WonderLab

Posted on Jun 29

Workflow Series (02): Design Patterns — Four-Layer Architecture, Three Context Modes, and Approval Gate Design

#ai #workflow #designpatterns #production

From Script to Engineering

An early-stage workflow is often a single file: one Markdown that describes everything, with all configuration hardcoded. This works at small scale. As the workflow grows, three problems appear:

Changing a timeout requires finding and updating multiple locations
Subagent task prompts are scattered through the workflow definition, impossible to test independently
Security policy and business logic are mixed together, making compliance review painful

The four-layer architecture addresses all three.

Four-Layer Architecture

Policy Layer     policy.md         Execution rules, global constraints
                                    → Who can do what; authorization for high-risk ops

Workflow Layer   workflow.md        Phase / Step structure, routing logic
                                    → The skeleton; no specific task content

TaskSpec Layer   templates/         Subagent task prompt templates
                                    → Detailed instructions and output contracts per task

Tool/Skill Layer skills/            Atomic capabilities
                                    → Skill definitions reusable across workflows

Core principle: each layer changes only its own concerns — nothing crosses layers.

# ✅ Correct: change analysis timeout → edit workflow.md
phase_3_analyze:
  timeout: 30m  ← Workflow Layer change

# ✅ Correct: change analysis output format → edit templates/analyze.md
## Output Contract
{"confidence": float, "root_cause": str, ...}  ← TaskSpec Layer change

# ❌ Wrong: write permission rules inside a task prompt (permissions belong in Policy Layer)
# ❌ Wrong: write specific analysis steps inside workflow.md (steps belong in TaskSpec Layer)

Layer separation makes changes safe: editing templates/ only affects the corresponding subagent's output. Editing policy.md cannot accidentally break routing logic.

Context Passing Modes

Deciding what a subagent should know is where workflow design goes wrong most often.

Passing the main Agent's full history to every subagent is the most common mistake. Context explodes: subagents slow down, output quality drops, token cost doubles.

Choose a passing mode based on what the subagent actually needs. Three modes:

accumulate

Definition: pass all relevant outputs the workflow has produced so far.

When to use: the subagent synthesizes conclusions from multiple earlier phases.

# Phase 7: write closing notification, needs conclusions from the whole workflow
phase_7_notify:
  context_mode: accumulate
  context_inputs:
    - phases.phase3.root_cause_summary
    - phases.phase4.fix_summary
    - phases.phase5.commit_result
    - phases.phase6.review_status

The Phase 7 subagent needs root cause, fix summary, commit outcome, and review status. Missing any one of them produces an incomplete notification.

last_only

Definition: pass only the output of the immediately preceding phase or step.

When to use: the subagent's task depends entirely on its direct predecessor; history is irrelevant.

# Phase 2: extract log files — only needs the attachment path from Phase 1
phase_2_extract_logs:
  context_mode: last_only
  context_inputs:
    - phases.phase1.attachment_path   # one field is all it needs

Extracting logs doesn't need the full Jira ticket details — just where the file is. Passing all of Phase 1's output wastes context. last_only enforces taking only what's needed.

explicit

Definition: name every specific field the subagent needs, sourcing from any prior phase.

When to use: the subagent needs specific fields from multiple phases, but not the complete output of any single phase.

# Phase 3: root cause analysis — needs bug_info (Phase 1) + log_dir (Phase 2)
phase_3_analyze:
  context_mode: explicit
  context_inputs:
    - source: phases.phase1
      fields: [bug_info.summary, bug_info.stack_trace, bug_info.jira_key]
    - source: phases.phase2
      fields: [log_dir, extracted_files]

Phase 3 needs the bug description (from Phase 1) and the log directory (from Phase 2), but not Phase 1's attachment path or Phase 2's raw extraction log. explicit mode controls precisely what flows into the subagent.

Choosing a Mode

Subagent synthesizes conclusions from multiple phases  → accumulate
Subagent depends only on its direct predecessor        → last_only
Subagent needs specific fields from multiple sources   → explicit (recommended default)

explicit is the safest default. Even when you're unsure what a subagent needs, start by naming specific fields. It's easier to debug than over-passing, and it documents the data dependencies explicitly.

Approval Gate Design

Approval gates are the nodes where humans intervene. Incomplete gate definitions are a common source of production incidents.

Three Gate Types

interrupt (blocking)
  Workflow pauses completely until human responds
  For: high-risk operations (code merge, production deploy)

notification (non-blocking)
  Workflow continues; human is notified in parallel
  For: low-risk operations where awareness is enough

approval (async)
  Asynchronous wait for approval within a specified time window
  For: formal approval processes with SLA requirements

Five Required Fields

# Complete approval gate definition
- gate_id: gate_B
  type: interrupt
  trigger_condition: "fix_result.all_passed == false after 3 retries"
  message: |
    Fix attempts: 3 failures.
    Root cause: {{ phases.phase3.root_cause_summary }}
    Last error: {{ phases.phase4.last_error }}

    Choose next action:
  options:
    - label: Manual intervention
      value: manual_fix
    - label: Re-analyze root cause
      value: re_analyze
    - label: Mark as requires manual fix
      value: mark_manual
  timeout: 24h
  timeout_action: pause    # ← most commonly omitted field

timeout_action is the most frequently missing field. Options:

pause    → suspend workflow after timeout, wait for human to resume (most common)
continue → proceed with default option after timeout (low-risk notification gates)
abort    → terminate the entire workflow after timeout (strict time-window operations)

A gate without timeout_action leaves the workflow hanging indefinitely: no alert fires, no record is written, no recovery path exists.

Approval Gate Message Design

The gate message is read by humans. It directly determines how fast decisions get made.

✅ Effective gate message:
  "Test pass rate: 67% (8/12 passing)
   Failing tests: test_null_input, test_overflow
   Current fix: modified boundary check in parseInput()
   Recommended action: re-analyze root cause — failure pattern
   doesn't match the identified root cause"

❌ Ineffective gate message:
  "Fix failed. Choose an action."

A good message lets the reviewer decide in 30 seconds. A bad one sends them to the logs.

Serial Retry vs Parallel Candidates

When a workflow encounters failure, two response strategies exist. Choosing the wrong one degrades efficiency or quality.

Serial Retry

Attempt 1 → fail
           ↓ (with failure reason + feedback)
Attempt 2 → fail
           ↓ (with failure reason + feedback)
Attempt 3 → pass

When to use: the error reason is concrete, later attempts can learn from earlier failures, and there's meaningful variation in angle or approach.

Example: root cause analysis (Phase 3)
  Attempt 1: analyze from code perspective
  Attempt 2: feedback "code analysis confidence low — try log anomaly patterns"
  Attempt 3: feedback "try tracing the call chain chronologically"

Each retry applies the previous failure as a learning signal.

Parallel Candidates

            → Candidate A → test → pass  ← select this
analysis →  → Candidate B → test → fail
            → Candidate C → test → fail

When to use: the solution space is diverse, predicting which approach will work is impossible, and exploring multiple options concurrently then selecting the best is more efficient than serial exploration.

Example: code fix (Phase 4)
  Candidate A: fix the boundary check logic
  Candidate B: fix the caller's input validation
  Candidate C: fix parseInput()'s default value handling

All three approaches could be correct. Run them concurrently,
select the one that passes tests with the highest coverage.

Selection Principle

Later attempts can learn from earlier failures → serial retry
Multiple approaches are equally plausible      → parallel candidates
Time-sensitive, can't afford serial latency    → parallel candidates
Comparing solution quality is the goal         → parallel candidates

Document the strategy in the workflow definition — it makes debugging straightforward:

phase_3_analyze:
  retry_strategy: serial          # different angles, learning from failure
  max_retries: 3
  feedback_mode: include_prev_error

phase_4_fix:
  retry_strategy: parallel        # solution space is diverse, select best
  parallel_candidates: 3
  selection_criteria: passed_tests AND max_coverage

Design Checklist

Four-layer separation

[ ] Policy (permissions/security) and Workflow (routing/structure) are in separate files
[ ] Subagent task prompts live in independent templates/ files
[ ] config.yaml centralizes mutable parameters (timeouts, retry counts)

Context passing

[ ] Every subagent invocation declares a context_mode
[ ] explicit mode lists specific fields — no whole Phase outputs passed in
[ ] Main Agent's full history is not sent to every subagent

Approval gates

[ ] Every gate has both timeout and timeout_action
[ ] Message contains enough information for a 30-second decision
[ ] Option values are enumerated types, not free text

Retry strategy

[ ] Every node with retry logic is labeled serial or parallel
[ ] Serial retries have feedback_mode (failure reason flows back to the generator)
[ ] Parallel candidates have a defined selection_criteria

Summary

The four-layer architecture's value is isolation: Policy changes don't affect routing; templates/ changes are independently testable — this is the foundation of maintainability
explicit is the safest default for Context passing: it names every field, costs fewer tokens than accumulate, is more flexible than last_only, and is easier to debug when something goes wrong
Serial retry vs parallel candidates is a directional choice: root cause analysis benefits from serial retry (learning is effective); code fix benefits from parallel candidates (solution space is diverse) — reversing them degrades both efficiency and quality

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

Top comments (1)

vic xie • Jun 29

TextStow could be useful for this workflow — clipboard history + reusable favorites + prompt templates + cleanup for JSON/PDF/URLs. Local-first, free: textstow.com