DEV Community

Cover image for Creating Specialized AI Agents: Developer, Tester, Reviewer, Documenter
Nazar Boyko
Nazar Boyko

Posted on • Originally published at nazarboyko.com

Creating Specialized AI Agents: Developer, Tester, Reviewer, Documenter

One universal AI agent sounds convenient. One agent to read tickets, write code, generate tests, review security, update docs, and create pull requests, all from a single prompt.

Nice idea.

But in real engineering work, one giant agent often becomes weaker than several small agents with clear jobs. The reason is that software work has different modes of thinking. Writing code is not the same as testing code. Reviewing code is not the same as documenting code, and security analysis is not the same as release-note writing. A good AI architecture respects those differences instead of flattening them into one prompt.

The problem with the universal agent

A universal agent usually receives a prompt like this:

You are a senior engineer. Implement the task, write tests, review the code,
update documentation, and create a PR.
Enter fullscreen mode Exit fullscreen mode

This may work for small demos, but it creates problems in real projects:

- The agent mixes responsibilities.
- It may review its own assumptions too gently.
- It may skip tests to finish faster.
- It may edit too many files.
- It may forget documentation.
- It may produce a confident summary without enough evidence.
Enter fullscreen mode Exit fullscreen mode

Humans have the same issue, which is exactly why teams separate implementation, review, QA, security, documentation, and release into different processes. Specialized agents follow the same idea.

A comparison: one overloaded giant agent juggling many tools on the left, versus five small specialized agents passing clean workflow cards on the right.

Specialized agents are easier to control

A specialized agent has a narrow job, and a narrow job is much easier to define:

- what it can read
- what it can edit
- what commands it can run
- what output it must produce
- when it must stop
Enter fullscreen mode Exit fullscreen mode

For example, a Tester Agent can be allowed to edit only test files:

agent: tester
can_read:
  - app/**
  - src/**
  - tests/**
can_edit:
  - tests/**
can_run:
  - "php artisan test *"
  - "npm test *"
never:
  - "edit production code"
  - "change public API behavior"
Enter fullscreen mode Exit fullscreen mode

That is much safer than handing a general-purpose agent the keys to everything.

Developer Agent

The Developer Agent implements scoped changes. It should not be responsible for everything; its main job is to make the smallest safe code change that satisfies clear acceptance criteria.

Good input:

## Task
Prevent duplicate weekly reminders after form submission.

## Acceptance Criteria
- If a weekly form was submitted this week, do not send reminder.
- Existing reminder interval rules must still apply otherwise.

## Allowed Files
- app/Services/WeeklyReminderEligibility.php
- tests/Feature/WeeklyReminderEligibilityTest.php
Enter fullscreen mode Exit fullscreen mode

Expected behavior:

- read related files
- explain current behavior briefly
- modify only allowed files
- keep change small
- report exact diff summary
Enter fullscreen mode Exit fullscreen mode

Example output:

## Developer Agent Report

Changed:
- Added submitted-form guard before interval check.
- Kept existing interval calculation unchanged.

Files changed:
- app/Services/WeeklyReminderEligibility.php

Needs verification:
- Week boundary behavior should be tested for configured timezone.
Enter fullscreen mode Exit fullscreen mode

Notice what is missing. The Developer Agent does not mark the whole task complete; it reports what it changed and what still needs checking, and then it stops.

Tester Agent

The Tester Agent tries to break assumptions. Its job is not to make the implementation look good; its job is to prove behavior.

Good tasks:

- Add regression test for a reported bug.
- Add edge-case tests for date boundaries.
- Run focused test suite.
- Report missing coverage.
Enter fullscreen mode Exit fullscreen mode

Example Laravel tests:

public function test_user_with_submitted_form_does_not_receive_reminder(): void
{
    Notification::fake();

    $user = User::factory()->create();

    WeeklyForm::factory()->for($user)->create([
        'submitted_at' => now(),
    ]);

    $this->artisan('app:send-weekly-reminders')
        ->assertSuccessful();

    Notification::assertNotSentTo($user, WeeklyReminder::class);
}

public function test_user_without_submitted_form_can_receive_reminder(): void
{
    Notification::fake();

    $user = User::factory()->create();

    $this->artisan('app:send-weekly-reminders')
        ->assertSuccessful();

    Notification::assertSentTo($user, WeeklyReminder::class);
}
Enter fullscreen mode Exit fullscreen mode

A strong Tester Agent reports not only what passed, but also what it could not verify:

## Tester Agent Report

Added tests:
- submitted users do not receive reminders
- users without submitted forms still receive reminders

Ran:
- php artisan test --filter=WeeklyReminderEligibilityTest ✅

Not covered:
- timezone-specific week boundaries
- users with multiple submitted forms
Enter fullscreen mode Exit fullscreen mode

That "not covered" section is the valuable part, because it tells you exactly where the implementation is still unproven.

Reviewer Agent

The Reviewer Agent reads the diff like a code reviewer, and a good reviewer does not simply praise the work. It should check:

- Is the change minimal?
- Are names clear?
- Is behavior hidden in the wrong layer?
- Are tests meaningful?
- Is there duplicated logic?
- Could this break existing callers?
Enter fullscreen mode Exit fullscreen mode

Example review output:

## Reviewer Agent Findings

### Concern: Week boundary depends on server timezone
The implementation uses `now()->startOfWeek()` without checking the app's configured user timezone.

Recommendation:
Use the same timezone source used by the reminder scheduler, or add a test proving this behavior.

### Positive
The change is small and keeps existing interval logic unchanged.
Enter fullscreen mode Exit fullscreen mode

A Reviewer Agent is useful precisely because it creates friction, and good engineering needs friction in the right places.

A code review desk: a magnifying glass over a pull request diff, beside a checklist for Scope, Tests, Risk, and Naming.

Security Agent

The Security Agent focuses on risk, and it should be skeptical by default.

Checklist:

- authorization checks
- authentication bypass
- SQL injection
- unsafe shell execution
- secret exposure
- sensitive data logging
- insecure redirects
- dependency risk
- excessive permissions
Enter fullscreen mode Exit fullscreen mode

Example prompt:

Review this diff for security risks. Do not edit files.
Return findings with severity, file, reason, and recommendation.
Enter fullscreen mode Exit fullscreen mode

Example output:

## Security Agent Report

### Medium: Missing authorization check
File: app/Http/Controllers/InvoiceController.php

The new endpoint returns invoice data but does not call a policy or permission check.

Recommendation:
Add `$this->authorize('view', $invoice)` and a feature test for unauthorized access.

### Low: Log may expose customer email
File: app/Services/BillingService.php

The error log includes full request payload.

Recommendation:
Log only the invoice ID and gateway error code.
Enter fullscreen mode Exit fullscreen mode

The Security Agent should usually be read-only. Security patches should go through a Developer Agent or a human, so the agent that finds a risk is not the same one that quietly rewrites the code around it.

Documentation Agent

The Documentation Agent turns implementation details into human-readable guidance. It can update:

- README
- docs folder
- API examples
- changelog
- migration notes
- release notes
Enter fullscreen mode Exit fullscreen mode

Example input:

Behavior changed:
Weekly reminders are skipped when a user submitted the current weekly form.

Files changed:
- app/Services/WeeklyReminderEligibility.php
- tests/Feature/WeeklyReminderEligibilityTest.php
Enter fullscreen mode Exit fullscreen mode

Example documentation update:

### Weekly Reminder Eligibility

A user is not eligible for a weekly reminder if they already submitted the
weekly check-in form for the current week.

If no form was submitted, the existing reminder interval rules still apply.
Enter fullscreen mode Exit fullscreen mode

This is one of the highest-value specialized agents, because documentation is the first thing that gets forgotten when engineers are busy.

Orchestrator Agent

The Orchestrator Agent coordinates the others, and the key rule is that it should not do all the work itself. Its job is:

- split the task
- assign agents
- pass context
- enforce order
- check required outputs
- stop at approval gates
- combine final report
Enter fullscreen mode Exit fullscreen mode

Example workflow:

Orchestrator
  ↓
Analysis Agent: find relevant files
  ↓
Tester Agent: create failing test
  ↓
Developer Agent: implement change
  ↓
Tester Agent: run checks
  ↓
Reviewer Agent: review diff
  ↓
Security Agent: review risks
  ↓
Documentation Agent: update docs
  ↓
Orchestrator: final PR summary
Enter fullscreen mode Exit fullscreen mode

The orchestrator creates structure; the specialized agents create focused output.

An orchestration wheel: the Orchestrator at the center with six labeled spokes for Developer (Implement), Tester (Verify), Reviewer (Review), Security (Protect), Documentation (Explain), and Analysis (Discover).

How agents hand off work

Handoffs should be structured. Do not pass a vague paragraph when a typed artifact would work better.

Example handoff from Analysis Agent to Developer Agent:

{
  "task": "Prevent duplicate weekly reminders after form submission",
  "relatedFiles": [
    "app/Console/Commands/SendWeeklyReminders.php",
    "app/Services/WeeklyReminderEligibility.php",
    "tests/Feature/WeeklyReminderEligibilityTest.php"
  ],
  "currentBehavior": "Reminder eligibility checks interval but not submitted weekly forms.",
  "recommendedChange": "Add submitted-form guard before interval logic.",
  "risks": [
    "week boundary timezone behavior"
  ]
}
Enter fullscreen mode Exit fullscreen mode

A typed artifact like this is easier for the next agent to use and easier for humans to inspect.

A practical agent team configuration

Here is a simple configuration example:

agents:
  analysis:
    role: "Find relevant code and explain current behavior"
    edit: false

  developer:
    role: "Implement scoped code changes"
    edit: true
    requires_approval_for:
      - "production code"
      - "dependencies"

  tester:
    role: "Create and run tests"
    edit_paths:
      - "tests/**"

  reviewer:
    role: "Review code quality and maintainability"
    edit: false

  security:
    role: "Review security and privacy risks"
    edit: false

  documentation:
    role: "Update docs and changelog"
    edit_paths:
      - "README.md"
      - "docs/**"
      - "CHANGELOG.md"
Enter fullscreen mode Exit fullscreen mode

This setup is not complicated, and that is the point. You can start small.

Final thought

Specialized agents are not about making AI architecture fancy. They are about making AI work easier to control.

A Developer Agent implements. A Tester Agent verifies. A Reviewer Agent challenges. A Security Agent protects. A Documentation Agent explains. An Orchestrator Agent coordinates. That structure mirrors how real engineering teams already work, and that is exactly why it works.

One giant agent may look impressive in a demo. A small team of focused agents is the thing that holds up in production.

Sources used


Originally published at nazarboyko.com.

Top comments (5)

Collapse
 
nexuslabzen profile image
nexus-lab-zen

The role split is clean, and having the Tester report "what it could not verify" is the part most setups skip. The edge I keep hitting: the Orchestrator routes on each agent's self-report, so a handoff carries the output but not an independent check that the done-claim is real. A Tester saying "these paths are uncovered" is itself a claim — in many setups, nothing downstream distinguishes an honest "I couldn't cover this" from a convenient one, and the Reviewer often inherits that gap rather than closes it.

What seems to work better than trusting the report is making each handoff carry evidence the next agent (or a human) can re-check on its own — the diff plus the artifact that proves the claim, not the agent's narration of it. Curious how you're thinking about the Orchestrator verifying coverage claims vs. trusting them, especially when the Tester's "could not verify" set is self-declared.

Collapse
 
nazarboyko profile image
Nazar Boyko

This is the sharpest critique of the whole pattern, and you're right, routing on self-reports just relocates the trust problem instead of solving it. A Tester's "couldn't cover X" is exactly as gameable as a Developer's "done."
The way I'm leaning now is to make the Orchestrator never accept a claim that isn't accompanied by a re-runnable artifact. Coverage isn't "the Tester said so", it's the actual coverage report (or a diff of which lines/branches the new tests touch) attached to the handoff, so the Reviewer or a human can re-execute it independently. If the artifact is missing, the handoff is rejected at the gate, not trusted-then-reviewed.
That shrinks the Tester's job from "judge coverage" to "produce evidence of coverage," which is a much harder thing to fake convincingly. The honest-vs-convenient "could not verify" still exists, but now it's checkable: a self-declared gap that the coverage data contradicts becomes a Reviewer finding rather than something inherited silently.
Where it gets genuinely hard, and I don't think I've solved it, is non-executable claims "this is safe across timezones," "no behavioral regression for callers." There's no cheap artifact for those, so they still fall back to a human gate. Curious whether you've found a way to make those re-checkable, or whether you just treat them as always-escalate.

Collapse
 
nexuslabzen profile image
nexus-lab-zen

That last bucket is the one I've sunk the most time into, and I don't have a clean win either — but the move that helped was to stop trying to make the claim executable and instead force it to carry its own scope boundary as the artifact.

So "safe across timezones" never ships as a judgment. It ships as "exercised under TZ=UTC, America/Sao_Paulo, Asia/Kolkata — outputs attached." Now it's re-runnable, and more importantly it's honest about being bounded: anything outside that enumerated set is explicitly not covered rather than implied-safe. The universal shrinks to the finite set actually touched.

"No behavioral regression for callers" gets the same treatment, but the artifact is a scope declaration instead of outputs: "checked the callers grep finds for X; anything added after this commit, or reached via reflection/dynamic dispatch, is outside this check." That turns an unfalsifiable universal into a checkable, re-greppable set — and the residual that genuinely can't be enumerated becomes a small, named thing.

That's what made it non-binary for me: not always-escalate vs trust, but shrink the escalation surface down to the irreducibly-human residual and make the agent name that edge out loud. The human gate then only sees the part that's actually un-automatable, not the whole claim — same win as your coverage case: a self-declared boundary the data can later contradict, instead of a silent inheritance.

Where I'd push it back to you: could the Orchestrator treat "scope declaration present and well-formed" as itself a gateable artifact — reject the handoff if the claim doesn't name its own boundary — even when the underlying claim isn't executable?

Collapse
 
muro_710f6234 profile image
Muro

Great article! I like the focus on specialized AI agents instead of expecting one assistant to do everything. Breaking responsibilities into clear roles makes the workflow more practical, easier to review, and much closer to how real engineering teams collaborate. Nicely explained with useful examples.

Collapse
 
nazar_boyko profile image
Nazar Boyko

Thanks Muro! Yeah, that mirror-to-real-teams angle was the thing I kept coming back to — we already split implementation, QA, review, and security for a reason, and pretending one agent can hold all those modes at once just reintroduces the problem we solved years ago. Appreciate you reading 🙌