Takayuki Kawazoe

Posted on May 5

The unit you pass between agents is the architecture — Purple to Blue with the implementation diff

#ai #claude #testing #playwright

We had a workflow where one agent (the orchestrator) would land an implementation, then hand the project off to a second agent (the QA one) to generate E2E tests. The QA agent did everything right by every prompt-engineering checklist. It read the codebase. It enumerated user flows. It produced clean Playwright scenarios with realistic selectors. The tests ran. They passed. They were green and beautiful and they were testing the wrong feature.

Specifically: the orchestrator had just changed how the invitation signup form handles a new error state. The QA agent had no idea. It saw "invitation signup" in the task title, read the codebase from scratch, found the most prominent invitation flow (the one we'd shipped two months earlier), and wrote a thorough test for that. The new error state — the actual delta — was untouched.

The fix is embarrassingly small in retrospect. The orchestrator already knew what it had changed; it just wasn't telling the QA agent. We started passing the git diff. Test scope tightened immediately.

I'm building an AI dev harness called Codens; the relevant context here is that the orchestrator agent (Purple) and the QA agent (Blue) are separate services with separate Celery workers, separate prompts, and separate Claude Agent SDK contexts. They communicate over an internal HTTP API. The implementation-diff handoff is one HTTP field, but it's the field that decides whether Blue's test generation is precise or a confident guess.

Why agents need the diff handed to them

The naïve mental model — the one I had for longer than I should have — is that any agent with codebase access can figure out "what changed" by itself. git log -1, git diff HEAD~1, read the files. The information is right there. Why should the orchestrator pre-compute it?

The reasons fall into three buckets, and they're not obvious until you watch a generation go off the rails.

The QA agent has the wrong frame. Blue's prompt is built around "generate test scenarios for this feature." Without a diff, "feature" gets inferred from the task goal text — usually a sentence pulled from the Notion ticket. "Add error handling to invitation signup" expands in the agent's working memory to "the invitation signup feature," and the agent tests the whole feature. The diff would have collapsed that frame: it's not the feature, it's the four lines that changed.

Subtle changes get lost. Some implementation changes don't show up in the file structure or the route table. A condition tightened from > to >=. A default flipped from true to false. A new field added to a request schema that's only validated server-side. An agent walking the codebase from scratch will read the changed file but has no reason to focus on those lines specifically — they look like the rest of the file. The diff's job is to point.

Test scope drifts toward the user-facing. Agents enumerating scenarios from a feature description gravitate toward end-to-end happy paths because those read like "what a user does." But the change might be a backend-only refactor that doesn't change any user-visible behavior — its tests should be assertion-heavy at a different layer. The diff is the only signal that tells the QA agent "the change isn't what the user does, it's how this function rejects malformed input now."

You can mitigate all of this by writing a sharper task description. We tried. Task descriptions kept being the things QA tickets are written as: a goal, a user, a desired outcome. They are by design about features, because they're about what the human asked for. The diff is the thing that says what the agent actually did with that ask.

The handoff: Purple to Blue, over HTTP

The wire format is plain. Purple has the diff sitting in the working tree after its develop step (it just made the changes). It serializes the diff as a string and POSTs it to one of Blue's two internal test-generation endpoints. Blue's request schema declares the field as optional:

class GenerateE2ETestsFromTaskRequest(BaseModel):
    # ... task_goal, target_url, acceptance_criteria, test_count, auth_config ...
    code_diff: str | None = Field(
        None,
        description=(
            "Git diff of the implementation changes (git diff <base>...HEAD). "
            "When provided, Claude uses the actual code changes to generate more "
            "precise test scenarios."
        ),
    )

Two paths consume that field. The fast path is scenario extraction — Blue calls Claude once with the task goal and the diff, and gets back a list of scenario strings. The slow path is the exploratory agent — a multi-phase Claude Agent SDK session that walks the codebase with an MCP server, then drives a Playwright browser, then synthesizes tests. Both paths needed to learn the diff.

The fast path's prompt change is the boring one, and that's the point. In claude_client.py the user prompt got a single conditional block:

diff_section = ""
if code_diff:
    diff_section = (
        f"\n\n実装差分（この実装内容を参照してテストシナリオを生成してください）：\n"
        f"```
{% endraw %}
diff\n{code_diff}\n
{% raw %}
```\n"
    )

user_prompt = f"""タスクゴール：
{task_goal}{diff_section}
上記のタスクゴール{('と実装差分' if code_diff else '')}から最大{count}個のE2Eテストシナリオを抽出してください。
各シナリオは「- 」で始まる形式で1行ずつ出力してください。"""

(The Japanese is incidental — Blue's QA prompts are localized for our primary use case. The structure is what matters: append the diff to the user prompt as a fenced


 block, mention it in the closing instruction sentence, leave the system prompt unchanged.)

The system prompt got one new bullet:



```plaintext
4. 実装コードが提供されている場合は、実際に実装されたエンドポイントやUIパスを優先してテストする

Roughly: "if implementation code is provided, prioritize testing the endpoints and UI paths that were actually implemented." That's the steering signal. Without it, Claude reads the diff but treats it as supplementary context — interesting but not authoritative. The bullet flips the priority: the diff is the source of truth about what to test, the task goal is the source of truth about why.

The exploratory path is more interesting because the diff lands in a different phase. The exploratory agent runs three phases — code analysis, browser discovery, test synthesis — and only the first phase needs the diff. The agent client gets the diff in its constructor and stitches it into the code-analysis prompt:

diff_context = ""
if self._code_diff:
    diff_context = (
        f"\n\nIMPLEMENTATION DIFF (focus your analysis on these changed files):\n"
        f"```
{% endraw %}
diff\n{self._code_diff}\n
{% raw %}
```\n"
    )

user_prompt = f"""Analyze this codebase for E2E testing.

Exploration goal: {session.exploration_goal}
Focus areas: {', '.join(session.focus_areas) if session.focus_areas else 'All features'}{diff_context}
Start by getting the file structure, then read key page/component files.
{('Pay special attention to the files and endpoints shown in the diff above.' if self._code_diff else '')}"""

The "pay special attention" sentence is doing real work. The exploratory agent has tools — it'll happily spend ten turns reading files that have nothing to do with the change. With the diff and the steering sentence, the analysis phase concentrates its file-reading budget on the changed files, which means the browser-discovery phase that follows knows which UI paths to actually exercise.

The endpoint that fronts the exploratory path is a Celery launcher — POST returns a 202 with a task ID, and the worker runs for up to thirty minutes. The diff threads through the Celery kwargs:

task = generate_exploratory_e2e.apply_async(
    kwargs={
        "project_id": str(proj_id),
        "target_url": request.target_url,
        "exploration_goal": request.exploration_goal,
        # ... browser config, viewport, timeout, auth ...
        "code_diff": request.code_diff,
    },
    queue="default",
)

logger.info(
    "internal_exploratory_e2e_task_submitted",
    proj_id=str(proj_id),
    task_id=task.id,
    has_diff=request.code_diff is not None,
)

That has_diff log line is in there for a reason I'll come back to under pitfalls.

What changed in the test output

I want to be careful about what I claim here. We don't have a controlled measurement of test quality before and after — running the same task through both versions is hard when the orchestrator's behavior depends on the codebase state, and "test quality" is itself a contested metric. So I'll stick to qualitative changes I can point to in actual generated tests.

Scope tightened. Scenario lists got shorter and more specific. Before: "Test invitation signup happy path, test invitation signup with invalid email, test invitation signup as existing user, test resend invitation email." After (same task): "Test invitation signup with the new server-rejected error showing the inline message, test invitation signup retry after the error clears." The first list was four scenarios that all happened to mention invitation signup. The second was two scenarios that exercised the actual change.

Selectors leaned on changed UI. In the exploratory path, the browser-discovery phase was the noisier part — the agent would click through ten elements before settling on what to test. With the diff, it tended to navigate straight to the changed component. Both versions ended up at correct selectors, but synthesis had cleaner exploration history to draw from.

Subtle changes started getting tests. The > to >= class of change. Before, these almost always produced a "test feature X works correctly" scenario that didn't probe the boundary specifically. After, the diff made the boundary visible and tests started exercising the exact value where the condition flipped.

What didn't change:

Quality of the test itself. The Playwright code that came out the back end was about as good as before — selectors, waits, assertions. The diff doesn't help Claude write better Playwright; it helps Claude pick what to write Playwright for.

Performance on large diffs. A refactor diff that touches forty files and ten thousand lines is mostly noise from the agent's perspective. It's hard to focus on "the change" when the change is everywhere. The agent doesn't gracefully fall back to feature-level testing in that case — it kind of muddles through. Honest answer: large diffs are still hard, and we don't have a great solution beyond "split the task smaller upstream."

Pitfalls along the way

Things that bit us, in roughly the order they bit:

Diff format choice. The first version of this passed git diff --name-only — just the file list. The reasoning was "it's compact, the agent can read the files itself." The reasoning was wrong: the agent could read the files but still had no signal about which lines changed. We switched to full unified diff (git diff <base>...HEAD). Tests got better. Tokens went up; we'll come back to that. We also considered git diff --stat as a middle ground but the line counts are a weak signal and the agent didn't use them.

Context window economy. A unified diff that's two thousand lines isn't free. Blue's scenario-extraction Claude call has a 1024-token max output but the input includes the system prompt, task goal, and diff — and the diff is the variable. We watched a few large-refactor tasks hit the context limit and either truncate or fail. Current behavior is "send the whole diff and hope," which works for typical task-sized changes (tens to hundreds of changed lines) and fails ungracefully for refactors. The cleaner solution is something like "summarize the diff before sending if it exceeds a threshold," but summarizing a diff is itself a Claude call and we haven't done it. The crude mitigation is that Purple's task graph tries to keep tasks small, which makes diffs small as a side effect.

Purple commit timing vs Blue trigger timing. This is the subtle one. The diff Purple sends to Blue is computed at trigger time, not commit time. If Purple commits, then does another step, then triggers Blue, the diff Blue sees has to either be (a) recomputed at trigger time from the commit history, or (b) cached from the develop step. We started with (b) — store the diff right after develop, pass it through the workflow context to the trigger step. That worked until a step in between modified files (e.g., a follow-up cleanup step). The cached diff was now stale. Switching to (a) — recompute from git diff <base>...HEAD at trigger time, where <base> is the parent of the develop commit — fixed it. The lesson: the diff is a function of two refs in the workflow, not a snapshot. Treat it as a query, not a value.

The "did the diff actually arrive" question. When tests came back wrong, we needed to know whether the diff failed to send or whether the diff was sent and the agent ignored it. That's why the structured log line in the task launch logs has_diff:

logger.info(
    "internal_exploratory_e2e_task_submitted",
    proj_id=str(proj_id),
    task_id=task.id,
    has_diff=request.code_diff is not None,
)

It's a one-bit signal but it cuts the diagnosis space in half. Without it, every "the test scope was wrong" investigation started with a five-minute trace through Purple's HTTP client to figure out what payload it actually sent. With it, you check the log and either move on (diff was there, prompt issue) or look upstream (diff wasn't there, send issue).

Backward compatibility. The endpoint had to keep working for callers that don't send the diff — there are integration-test paths and direct-API consumers that hit it without going through Purple. So the field is Optional[str] and the prompt-construction code conditions every diff-related sentence on if self._code_diff. The unit tests cover both shapes:

def test_generate_accepts_code_diff(self):
    """generate endpoint accepts code_diff and passes it to ClaudeClient."""
    # ... mock setup ...
    with TestClient(app) as client:
        resp = client.post(
            f"/api/v1/internal/projects/{proj_id}/e2e-tests/generate",
            json={"task_goal": "Test login", "code_diff": diff},
            headers={"X-Internal-Api-Key": "test-key"},
        )
    assert resp.status_code == 201
    call_kwargs = (
        mock_claude.generate_test_scenarios_from_task_goal.call_args.kwargs
    )
    assert call_kwargs["code_diff"] == diff

def test_generate_without_code_diff_still_works(self):
    """generate endpoint works when code_diff is omitted (backward compat)."""
    # ... mock setup, no code_diff in body ...
    assert resp.status_code == 201

The "without_code_diff" test is the cheap insurance we needed because the prompt-construction code is full of conditional inserts and it would be very easy to break the no-diff path with an f-string mistake.

The general principle

The thing I keep coming back to: in a multi-agent system, the substance you pass between agents is more architecturally consequential than the prompt to any single agent. I had spent months tuning Blue's test-generation prompt — adding constraints, refining output format, picking better examples. None of that prompt-tuning closed the gap that one HTTP field closed in a day.

The reason is structural. Blue's prompt is a function — given inputs, produce tests. The quality ceiling of that function is bounded by the inputs. If the inputs don't include "what changed," no amount of prompt cleverness will produce tests that are about the change, because the information isn't in the function's domain. The agent will find something to test, and it will test that something well, but the something won't be the change.

This generalizes. In any pipeline where one agent's output feeds another's input, the question "what is the unit of work I'm passing between agents" is the architecture question. The candidates I've seen:

The task description. "Add error handling to invitation signup." Coarse. Loses information about implementation details.
The diff. What we landed on for Purple to Blue. Captures what changed; doesn't capture intent.
Diff plus task description. What Purple actually sends. Diff is what changed, task is why. Both are needed; diff alone makes good tests for the wrong reason, task alone makes plausible tests for the wrong feature.
A structured task spec. Task ID, acceptance criteria, file list, diff, test commands, auth config. We pass this for some agent transitions where Blue needs more than prose context. Heavier; harder to construct; easier to consume.
The conversation log. Some teams pass the previous agent's full chain-of-thought. Has the most information; has way too much noise; the receiving agent's context window pays for it.

Each option is a tradeoff between completeness and signal-to-noise. The diff was a sweet spot for the test-generation handoff specifically because (a) it's compact relative to a full conversation log, (b) it's structured enough that Claude reliably parses it, and (c) the information it carries — which lines of which files changed — was exactly the signal Blue was missing. For other handoffs the right unit is different. The PRD-to-task-graph handoff in green-codens passes a structured spec; the bug-to-fix handoff in red-codens passes a stack trace plus repro. There's no universal answer; the question is "what does the receiver need to do its job, and what's the minimum form that carries it."

The mental shift, for me, was from "let me make each agent better" to "let me make the channels between agents richer." The agents were already capable. What they needed was less ambiguous inputs. Once you frame the system as a graph of agents with edges that carry typed payloads, the design moves are about the edges — what gets passed, in what format, with what guarantees about freshness — not about the nodes.

Closing

If you're shipping a multi-agent system, I'm curious what unit you've landed on for the inter-agent payload. Is it a task description, a structured spec, a diff, a full conversation log, something else? What was the failure mode that pushed you toward that specific choice — was it scope drift like ours, or context-window economics, or something on the receiving agent's reliability? And once you'd picked it, what was the next thing that broke that you wish you'd designed for from the start?

The thing I haven't resolved yet, and would love to hear about: how do you handle the case where the diff is too large to send? Our current "split tasks smaller upstream" answer pushes the problem onto Purple's planning agent, but that's a constraint on planning, not a solution for QA. A "summarize the diff" pre-step is the natural next move, but compressing a diff loses the line-level precision that made the diff useful in the first place. Whoever has solved this for refactor-sized changes — I'd genuinely like to compare notes.