DEV Community

Cover image for I built an autonomous dev team with 3 AI agents that takes a Linear ticket all the way to a pull request
zvone187
zvone187

Posted on

I built an autonomous dev team with 3 AI agents that takes a Linear ticket all the way to a pull request

Ever since OpenClaw launched, I've been obsessed with one question: what does software development actually look like when AI agents can collaborate with each other? Not as autocomplete. Not as chat assistants. As actual teammates that coordinate work between them. Here is how my desktop looks like now:

My current dev workflow on Slack
My current dev workflow on Slack

I've been a developer for over 15 years, and I've seen a lot of hype around AI coding tools. Most of them promise the world -- "just describe your app and we'll build it." But in reality, the moment you go beyond a simple demo, things break down. From my experience building GPT Pilot, I learned that the biggest problem isn't that AI can't write code -- it's that a single agent gets stuck in one way of thinking and can't course-correct.

So I took a different approach. Instead of one super-agent that does everything, I built a system of three agents that work like a real dev team: a Tech Lead, a Developer, and a QA. They coordinate through Slack, just like human teammates would.

In this post, I'll walk you through the entire setup -- how each agent works, how they hand off work to each other, why multiple agents outperform a single one, and how you can import all of the skills yourself. It's all open sourced here.

Transparency note: I'm a founder of Pazi, which is built on OpenClaw. Everything I'm sharing here works on both.

How the workflow starts

Workflow kickoff
Workflow kickoff

It all begins with a Linear ticket (full example screenshot). When I move a ticket to "To Do" and assign it to the Tech Lead agent, it picks it up automatically and starts a Slack thread. This thread becomes the central coordination hub for the entire ticket -- every handoff, every status update, everything happens here.

I can also kick things off directly from Slack if I prefer. Both entry points work.

The first thing the Tech Lead does might sound boring, but it's actually crucial: it sets up the environment.

Parallel environments with worktrees

Each worktree is hosted on a separate port
Each worktree is hosted on a separate port so I can test any implementation myself

What was really important for me was enabling parallel work. If you have multiple tickets being implemented at the same time, each one needs its own isolated environment. Otherwise, you're constantly dealing with port conflicts, branch collisions, and general chaos.

So here's what happens: for each ticket, the agent creates new git work trees for both of our repositories (we have a separate frontend and backend). Each work tree gets its own set of ports. The agent spins up the application on those ports and configures Nginx to expose the frontend so I can access it from a browser.

This matters a lot because at the end of the process, a human always needs to test the result. Having each ticket running on its own port means I can just click a link and see the implementation -- no need to check out branches, install dependencies, or figure out which port to use.

QA creates the testing plan BEFORE any code is written

QA report
QA report

Once the environment is set up, the Tech Lead tags the QA agent in Slack: "Hey, can you create a testing plan for this ticket?"

This is a deliberate design choice and, I think, one of the most important ones. Normally, when you let an AI agent just start coding, it writes the code and then tries to test the code it wrote. The problem? It tests its own implementation rather than testing from the user's perspective. It knows what it coded, so it writes tests that confirm what it coded -- not tests that verify the feature actually works for the user. Here is how that plan looks like (btw, yes, we made our agents build HTML reports - they are so much nicer to look at than just having a linear/slack wall-of-text comment). You can click on each test to see details.

By having the QA agent create a testing plan purely from the task description -- before any code exists -- we ensure the tests are about user behavior, not implementation details.

The testing plan gets posted to Linear so there's a record, and then the QA tags the Developer in Slack: "Hey, can you kick off the implementation?"

Why multiple agents outperform a single one

Agents hyping each other
Agents hyping each other after few iterations of fixing an issue

Ok, this is the part I'm most excited about. Bear with me.

In theory, a single Claude Code session should be able to do everything: read the ticket, plan the implementation, write the code, test it, iterate until it works, and create a PR. Why wouldn't it? It has the context, it has the tools.

But from our observation, it just doesn't work that way. The biggest problem with LLMs in coding is that once they go down a path, they get stuck. The entire conversation history is in the context, and it becomes really hard for the model to step back and think from a completely different angle. It's like asking someone to critique their own essay while they're still writing it -- the train of thought is too strong.

This is actually why we have code reviews in real teams. Not because the original developer is bad, but because a fresh pair of eyes -- without the baggage of the implementation journey -- sees things differently.

So here's what we do for the implementation:

  1. Dual planning -- We spin up both Claude Code and Codex separately to create implementation plans. Neither knows about the other's plan.
  2. Cross-review -- Each agent reviews the other's plan. Again, fresh context -- no prior train of thought.
  3. Synthesis -- A final agent looks at both plans, all the reviews, all the pros and cons, and creates the best plan from both perspectives.

The key insight: whenever a plan is created or reviewed, the agent doing the review doesn't have the previous agent's chain of thought in its context. It's looking at the output with completely fresh eyes.

You can think of it as asking for a second opinion from a doctor who doesn't know what the first doctor said -- they'll look at the symptoms independently rather than anchoring on the first diagnosis.

Implementation and PR

Diagram of the entire implementation flow
Diagram of the entire implementation flow

Once the final plan is ready, the actual coding starts. A Claude Code agent takes over with access to Playwright and the Figma MCP. With Playwright it can interact with the running app to verify things work. With Figma MCP it can reference the original designs.

The agent implements the feature, runs the tests, iterates, and when it's done -- it creates a pull request (full PR screenshot).

That PR then goes to a separate Codex agent for code review. This is another fresh-context moment: the reviewing agent has never seen the implementation conversation. It only sees the PR diff and the original task description.

They go back and forth -- the reviewer requests changes, the original Claude Code agent addresses them or pushes back with reasoning. Just like a real code review in Slack. After a few rounds, the PR is approved.

QA tests in the browser

Now comes the final verification. The Developer tags the QA agent in Slack: "Hey, can you test this feature based on the initial plan?"

Developer calling the QA to start testing
Developer calling the QA to start testing

The QA agent works entirely in the browser. Remember those ports we set up at the beginning? Now they matter. The QA agent navigates to the running instance, clicks through the features, and tests everything from the user's perspective based on the plan it created before any code was written.

If something fails, it goes back and forth with the Developer in Slack -- "this button doesn't work" or "the redirect goes to the wrong page." They iterate until all tests pass.

Here is how the final test report looks like (after 9 iterations between the developer and the QA).

Human review

Once all the automated coordination is done, the ticket gets moved to "Review" in Linear and assigned to a human. At this point, all the heavy lifting is done -- the code is written, reviewed, tested, and the PR is ready.

Final linear comment by the QA agent
Final linear comment by the QA agent

The human reviewer can click the link to the running instance (on its isolated port), log in, and test everything manually. No setup needed, no branch checkout, no "it works on my machine." Just click and verify.

The Tech Lead as coordinator

Throughout this entire process, the Tech Lead agent acts as the coordinator. It doesn't write code or test features -- it orchestrates. It tracks the status of each step, handles the handoffs between agents, and makes sure nothing falls through the cracks.

Tech lead ensuring that the final QA report has all screenshots
Tech lead ensuring that the final QA report has all screenshots

Think of it as a project manager who doesn't code but makes sure the right person is working on the right thing at the right time.

Why Slack

One thing I didn't expect was how much Slack changes the experience once you move beyond thinking of AI agents as coding tools.

Observability

When you have multiple agents working on a feature, you need to know what they're doing. Most people think you need some fancy dashboard or monitoring tool for that. But actually, the only thing you really need is the conversation between them. Just like with human coworkers -- you open the Slack thread and you can see exactly what happened. Who did what, what decisions were made, what questions came up. The whole history is right there, in a format you already know how to read.

Interaction

The second thing is even more important: it feels natural. When an agent has a question, it literally tags you in Slack. In the middle of the whole implementation, you can jump in. If they're heading in the wrong direction, you redirect them. If they have an ambiguous requirement, you clarify it. You're not switching to some separate tool or reviewing logs -- you're just replying in a thread, the way you would with any teammate.

Developer agent asks a human for a product decision
Developer agent asks a human for a product decision

I gotta say, this is the part that surprised me the most. After a while, it stops feeling like you're using a tool. It starts feeling like you're working with colleagues who happen to be really fast and never take lunch breaks. They tag you when they need you, they update you on progress, and the rest of the time they just get the work done.

What I learned

A few things that surprised me:

  • Fresh context is everything. The single biggest improvement came from making sure reviewing agents don't have the implementing agent's conversation history. It sounds simple, but it makes a massive difference in the quality of reviews and plans.

  • Testing plans before code matters. When the QA creates tests after seeing the code, the tests are biased toward the implementation. When it creates tests from just the task description, the tests actually catch real issues.

  • Parallel environments are non-negotiable. Without isolated work trees and ports, the whole thing falls apart the moment you try to work on two tickets at once.

  • Humans are still essential. This system doesn't replace developers -- it handles the routine implementation work so humans can focus on architecture decisions, product direction, and the weird edge cases that AI still struggles with.

  • Ticket descriptions are more important than ever. When I write vague instructions, the agents do something that technically works but doesn't match what I actually had in mind. When I spend five minutes describing exactly how I want a feature to look and behave -- they nail it. The irony is that the system is fast enough now that the limiting factor isn't AI capability, it's how well I communicate what I want. I personally use Wispr Flow -- I just hit record on my computer and talk for five minutes about how the feature should work, in as much detail as I can. It transcribes everything, I paste it into the ticket, and the results are dramatically better.

Try it yourself

I've open sourced all the skills that make this work. You can import them into OpenClaw or into Pazi and set up the same workflow for your team.

You can find all skills here with instructions for how to set it up on your OpenClaw or just go to Pazi and click on the tech team template and you'll be good to go.

If you try it, I'd genuinely love to hear how it goes -- what works, what breaks, what you'd do differently. Drop a comment or ping me directly.

This is still early. We're iterating on this every day, and there are definitely rough edges. But I gotta say, watching three AI agents coordinate a ticket from start to PR in Slack is something I didn't think I'd see this soon.

Top comments (0)