Nazar Boyko

Posted on Jun 25 • Originally published at nazarboyko.com

Creating Specialized AI Agents: Developer, Tester, Reviewer, Documenter

#ai #architecture #agents #specialized

One universal AI agent sounds convenient. One agent to read tickets, write code, generate tests, review security, update docs, and create pull requests, all from a single prompt.

Nice idea.

But in real engineering work, one giant agent often becomes weaker than several small agents with clear jobs. The reason is that software work has different modes of thinking. Writing code is not the same as testing code. Reviewing code is not the same as documenting code, and security analysis is not the same as release-note writing. A good AI architecture respects those differences instead of flattening them into one prompt.

The problem with the universal agent

A universal agent usually receives a prompt like this:

You are a senior engineer. Implement the task, write tests, review the code,
update documentation, and create a PR.

This may work for small demos, but it creates problems in real projects:

- The agent mixes responsibilities.
- It may review its own assumptions too gently.
- It may skip tests to finish faster.
- It may edit too many files.
- It may forget documentation.
- It may produce a confident summary without enough evidence.

Humans have the same issue, which is exactly why teams separate implementation, review, QA, security, documentation, and release into different processes. Specialized agents follow the same idea.

Specialized agents are easier to control

A specialized agent has a narrow job, and a narrow job is much easier to define:

- what it can read
- what it can edit
- what commands it can run
- what output it must produce
- when it must stop

For example, a Tester Agent can be allowed to edit only test files:

agent: tester
can_read:
  - app/**
  - src/**
  - tests/**
can_edit:
  - tests/**
can_run:
  - "php artisan test *"
  - "npm test *"
never:
  - "edit production code"
  - "change public API behavior"

That is much safer than handing a general-purpose agent the keys to everything.

Developer Agent

The Developer Agent implements scoped changes. It should not be responsible for everything; its main job is to make the smallest safe code change that satisfies clear acceptance criteria.

Good input:

## Task
Prevent duplicate weekly reminders after form submission.

## Acceptance Criteria
- If a weekly form was submitted this week, do not send reminder.
- Existing reminder interval rules must still apply otherwise.

## Allowed Files
- app/Services/WeeklyReminderEligibility.php
- tests/Feature/WeeklyReminderEligibilityTest.php

Expected behavior:

- read related files
- explain current behavior briefly
- modify only allowed files
- keep change small
- report exact diff summary

Example output:

## Developer Agent Report

Changed:
- Added submitted-form guard before interval check.
- Kept existing interval calculation unchanged.

Files changed:
- app/Services/WeeklyReminderEligibility.php

Needs verification:
- Week boundary behavior should be tested for configured timezone.

Notice what is missing. The Developer Agent does not mark the whole task complete; it reports what it changed and what still needs checking, and then it stops.

Tester Agent

The Tester Agent tries to break assumptions. Its job is not to make the implementation look good; its job is to prove behavior.

Good tasks:

- Add regression test for a reported bug.
- Add edge-case tests for date boundaries.
- Run focused test suite.
- Report missing coverage.

Example Laravel tests:

public function test_user_with_submitted_form_does_not_receive_reminder(): void
{
    Notification::fake();

    $user = User::factory()->create();

    WeeklyForm::factory()->for($user)->create([
        'submitted_at' => now(),
    ]);

    $this->artisan('app:send-weekly-reminders')
        ->assertSuccessful();

    Notification::assertNotSentTo($user, WeeklyReminder::class);
}

public function test_user_without_submitted_form_can_receive_reminder(): void
{
    Notification::fake();

    $user = User::factory()->create();

    $this->artisan('app:send-weekly-reminders')
        ->assertSuccessful();

    Notification::assertSentTo($user, WeeklyReminder::class);
}

A strong Tester Agent reports not only what passed, but also what it could not verify:

## Tester Agent Report

Added tests:
- submitted users do not receive reminders
- users without submitted forms still receive reminders

Ran:
- php artisan test --filter=WeeklyReminderEligibilityTest ✅

Not covered:
- timezone-specific week boundaries
- users with multiple submitted forms

That "not covered" section is the valuable part, because it tells you exactly where the implementation is still unproven.

Reviewer Agent

The Reviewer Agent reads the diff like a code reviewer, and a good reviewer does not simply praise the work. It should check:

- Is the change minimal?
- Are names clear?
- Is behavior hidden in the wrong layer?
- Are tests meaningful?
- Is there duplicated logic?
- Could this break existing callers?

Example review output:

## Reviewer Agent Findings

### Concern: Week boundary depends on server timezone
The implementation uses `now()->startOfWeek()` without checking the app's configured user timezone.

Recommendation:
Use the same timezone source used by the reminder scheduler, or add a test proving this behavior.

### Positive
The change is small and keeps existing interval logic unchanged.

A Reviewer Agent is useful precisely because it creates friction, and good engineering needs friction in the right places.

Security Agent

The Security Agent focuses on risk, and it should be skeptical by default.

Checklist:

- authorization checks
- authentication bypass
- SQL injection
- unsafe shell execution
- secret exposure
- sensitive data logging
- insecure redirects
- dependency risk
- excessive permissions

Example prompt:

Review this diff for security risks. Do not edit files.
Return findings with severity, file, reason, and recommendation.

Example output:

## Security Agent Report

### Medium: Missing authorization check
File: app/Http/Controllers/InvoiceController.php

The new endpoint returns invoice data but does not call a policy or permission check.

Recommendation:
Add `$this->authorize('view', $invoice)` and a feature test for unauthorized access.

### Low: Log may expose customer email
File: app/Services/BillingService.php

The error log includes full request payload.

Recommendation:
Log only the invoice ID and gateway error code.

The Security Agent should usually be read-only. Security patches should go through a Developer Agent or a human, so the agent that finds a risk is not the same one that quietly rewrites the code around it.

Documentation Agent

The Documentation Agent turns implementation details into human-readable guidance. It can update:

- README
- docs folder
- API examples
- changelog
- migration notes
- release notes

Example input:

Behavior changed:
Weekly reminders are skipped when a user submitted the current weekly form.

Files changed:
- app/Services/WeeklyReminderEligibility.php
- tests/Feature/WeeklyReminderEligibilityTest.php

Example documentation update:

### Weekly Reminder Eligibility

A user is not eligible for a weekly reminder if they already submitted the
weekly check-in form for the current week.

If no form was submitted, the existing reminder interval rules still apply.

This is one of the highest-value specialized agents, because documentation is the first thing that gets forgotten when engineers are busy.

Orchestrator Agent

The Orchestrator Agent coordinates the others, and the key rule is that it should not do all the work itself. Its job is:

- split the task
- assign agents
- pass context
- enforce order
- check required outputs
- stop at approval gates
- combine final report

Example workflow:

Orchestrator
  ↓
Analysis Agent: find relevant files
  ↓
Tester Agent: create failing test
  ↓
Developer Agent: implement change
  ↓
Tester Agent: run checks
  ↓
Reviewer Agent: review diff
  ↓
Security Agent: review risks
  ↓
Documentation Agent: update docs
  ↓
Orchestrator: final PR summary

The orchestrator creates structure; the specialized agents create focused output.

How agents hand off work

Handoffs should be structured. Do not pass a vague paragraph when a typed artifact would work better.

Example handoff from Analysis Agent to Developer Agent:

{
  "task": "Prevent duplicate weekly reminders after form submission",
  "relatedFiles": [
    "app/Console/Commands/SendWeeklyReminders.php",
    "app/Services/WeeklyReminderEligibility.php",
    "tests/Feature/WeeklyReminderEligibilityTest.php"
  ],
  "currentBehavior": "Reminder eligibility checks interval but not submitted weekly forms.",
  "recommendedChange": "Add submitted-form guard before interval logic.",
  "risks": [
    "week boundary timezone behavior"
  ]
}

A typed artifact like this is easier for the next agent to use and easier for humans to inspect.

A practical agent team configuration

Here is a simple configuration example:

agents:
  analysis:
    role: "Find relevant code and explain current behavior"
    edit: false

  developer:
    role: "Implement scoped code changes"
    edit: true
    requires_approval_for:
      - "production code"
      - "dependencies"

  tester:
    role: "Create and run tests"
    edit_paths:
      - "tests/**"

  reviewer:
    role: "Review code quality and maintainability"
    edit: false

  security:
    role: "Review security and privacy risks"
    edit: false

  documentation:
    role: "Update docs and changelog"
    edit_paths:
      - "README.md"
      - "docs/**"
      - "CHANGELOG.md"

This setup is not complicated, and that is the point. You can start small.

Final thought

Specialized agents are not about making AI architecture fancy. They are about making AI work easier to control.

A Developer Agent implements. A Tester Agent verifies. A Reviewer Agent challenges. A Security Agent protects. A Documentation Agent explains. An Orchestrator Agent coordinates. That structure mirrors how real engineering teams already work, and that is exactly why it works.

One giant agent may look impressive in a demo. A small team of focused agents is the thing that holds up in production.

Sources used

Claude Code subagents documentation: https://code.claude.com/docs/en/sub-agents
Claude Code permissions documentation: https://code.claude.com/docs/en/permissions
Anthropic writing effective tools for agents: https://www.anthropic.com/engineering/writing-tools-for-agents
Model Context Protocol tools specification: https://modelcontextprotocol.io/specification/2025-06-18/server/tools

Originally published at nazarboyko.com.

Top comments (5)

nexus-lab-zen • Jun 25

The role split is clean, and having the Tester report "what it could not verify" is the part most setups skip. The edge I keep hitting: the Orchestrator routes on each agent's self-report, so a handoff carries the output but not an independent check that the done-claim is real. A Tester saying "these paths are uncovered" is itself a claim — in many setups, nothing downstream distinguishes an honest "I couldn't cover this" from a convenient one, and the Reviewer often inherits that gap rather than closes it.

What seems to work better than trusting the report is making each handoff carry evidence the next agent (or a human) can re-check on its own — the diff plus the artifact that proves the claim, not the agent's narration of it. Curious how you're thinking about the Orchestrator verifying coverage claims vs. trusting them, especially when the Tester's "could not verify" set is self-declared.

Nazar Boyko • Jun 26

This is the sharpest critique of the whole pattern, and you're right, routing on self-reports just relocates the trust problem instead of solving it. A Tester's "couldn't cover X" is exactly as gameable as a Developer's "done."
The way I'm leaning now is to make the Orchestrator never accept a claim that isn't accompanied by a re-runnable artifact. Coverage isn't "the Tester said so", it's the actual coverage report (or a diff of which lines/branches the new tests touch) attached to the handoff, so the Reviewer or a human can re-execute it independently. If the artifact is missing, the handoff is rejected at the gate, not trusted-then-reviewed.
That shrinks the Tester's job from "judge coverage" to "produce evidence of coverage," which is a much harder thing to fake convincingly. The honest-vs-convenient "could not verify" still exists, but now it's checkable: a self-declared gap that the coverage data contradicts becomes a Reviewer finding rather than something inherited silently.
Where it gets genuinely hard, and I don't think I've solved it, is non-executable claims "this is safe across timezones," "no behavioral regression for callers." There's no cheap artifact for those, so they still fall back to a human gate. Curious whether you've found a way to make those re-checkable, or whether you just treat them as always-escalate.

nexus-lab-zen • Jun 26

That last bucket is the one I've sunk the most time into, and I don't have a clean win either — but the move that helped was to stop trying to make the claim executable and instead force it to carry its own scope boundary as the artifact.

So "safe across timezones" never ships as a judgment. It ships as "exercised under TZ=UTC, America/Sao_Paulo, Asia/Kolkata — outputs attached." Now it's re-runnable, and more importantly it's honest about being bounded: anything outside that enumerated set is explicitly not covered rather than implied-safe. The universal shrinks to the finite set actually touched.

"No behavioral regression for callers" gets the same treatment, but the artifact is a scope declaration instead of outputs: "checked the callers grep finds for X; anything added after this commit, or reached via reflection/dynamic dispatch, is outside this check." That turns an unfalsifiable universal into a checkable, re-greppable set — and the residual that genuinely can't be enumerated becomes a small, named thing.

That's what made it non-binary for me: not always-escalate vs trust, but shrink the escalation surface down to the irreducibly-human residual and make the agent name that edge out loud. The human gate then only sees the part that's actually un-automatable, not the whole claim — same win as your coverage case: a self-declared boundary the data can later contradict, instead of a silent inheritance.

Where I'd push it back to you: could the Orchestrator treat "scope declaration present and well-formed" as itself a gateable artifact — reject the handoff if the claim doesn't name its own boundary — even when the underlying claim isn't executable?

Muro • Jun 26

Great article! I like the focus on specialized AI agents instead of expecting one assistant to do everything. Breaking responsibilities into clear roles makes the workflow more practical, easier to review, and much closer to how real engineering teams collaborate. Nicely explained with useful examples.

Nazar Boyko • Jun 26

Thanks Muro! Yeah, that mirror-to-real-teams angle was the thing I kept coming back to — we already split implementation, QA, review, and security for a reason, and pretending one agent can hold all those modes at once just reintroduces the problem we solved years ago. Appreciate you reading 🙌