DEV Community

kunpeng-ai-lab
kunpeng-ai-lab

Posted on • Originally published at kunpeng-ai.com

Green Tests Are Evidence, Not Approval

#ai

Many teams are starting to use more than one AI coding agent.

One agent writes code. Another agent reviews. A human owner makes the final call.

That sounds reasonable, but without a shared process it can become unreliable very quickly.

The Executor may test its own work. The Reviewer may only check that tests are green. The Owner may receive a confident summary without durable evidence.

That is the problem ACS tries to solve.

ACS, short for Agent Collaboration SOP, is a vendor-neutral, file-first workflow for multi-agent engineering collaboration.

The core principle is:

Green tests are evidence, not approval.

Passing tests matter. But they do not prove that scope was respected, UI was inspected, docs match the actual files, public output was redacted, or the change is safe to release.

Why Green Tests Are Not Enough

Tests answer specific questions. Approval answers a broader question: should this change move forward?

Green tests do not automatically prove that:

  • the requested scope was respected;
  • the UI was opened and visually inspected;
  • screenshots exist where visual evidence is needed;
  • documentation and handoff notes match the actual files;
  • the implementation did not introduce architecture drift;
  • public output has been redacted;
  • the change is safe to release, merge upstream, or share publicly.

If a human teammate submitted a change with no clear handoff, no review evidence, no scope notes, and no release-risk assessment, most engineering teams would not treat "the tests passed" as enough.

AI-agent work should not get a weaker standard just because the summary sounds confident.

Owner, Executor, Reviewer

ACS separates three roles:

  • Owner: the human decision-maker responsible for goals, scope, release decisions, upstream PR boundaries, and business constraints.
  • Executor Agent: responsible for implementation, self-testing, evidence collection, and handoff.
  • Reviewer Agent: responsible for independent review across scope, architecture, tests, screenshots, evidence, redaction, and release risk.

The key rule is simple:

The executor does not approve itself.

An Executor can and should run tests. It can and should summarize what it changed. It can and should collect evidence.

But approval requires an independent check and a human decision.

From Chat Logs to Durable Files

Chat is useful while work is happening. It is a weak long-term engineering record.

Chat threads can be compressed. They can lose context. They can be separated from the exact repository state they were discussing. They can be hard for a later agent to inspect.

ACS prefers file-first handoff.

Typical ACS artifacts include:

  • Executor handoff
  • Reviewer report
  • Evidence ledger
  • Owner consensus report
  • Redacted case study
  • Anti-pattern review

This makes the workflow easier to resume after context compression, model changes, machine changes, or handoff to another agent.

Case Studies and Anti-Patterns

ACS keeps two long-term memory areas:

  • case-studies/ captures redacted examples of real collaboration.
  • anti-patterns/ captures recurring failure modes and prevention checklists.

Examples of useful anti-patterns include:

  • the Executor approves its own work;
  • the Reviewer only checks whether tests are green;
  • evidence exists only in chat;
  • UI review happens without screenshots;
  • handoff notes drift away from the actual files;
  • public materials are shared without redaction.

The goal is to turn repeated mistakes into reusable team memory.

Public Sharing Needs a Redaction Gate

Public examples are useful, but they must be safe.

AI agents can accidentally include sensitive details in handoffs, reports, issues, PR descriptions, blog drafts, and case studies.

Before publishing a case study, remove:

  • customer names;
  • private repository URLs;
  • local absolute paths;
  • tokens, cookies, API keys, and webhooks;
  • private chat logs;
  • unpublished business information.

The point is not to hide the engineering lesson. The point is to preserve the lesson without leaking what should remain private.

Open Source

ACS is open source, and practical contributions are welcome:

  • redacted case studies;
  • anti-pattern examples;
  • reviewer report improvements;
  • evidence ledger refinements;
  • examples from different agent tools and team setups.

GitHub:

https://github.com/kunpeng-ai-lab/agent-collaboration-sop

Full article:

https://kunpeng-ai.com/en/blog/agent-collaboration-sop-acs-case-library/?utm_source=blog_referral&utm_medium=referral&utm_campaign=acs-case-library-202605&utm_content=ending_cta

Multi-agent engineering does not become reliable just because more agents are involved.

It becomes reliable when execution, review, evidence, and human approval are separated clearly enough to be inspected.

Top comments (0)