From Script to Engineering
An early-stage workflow is often a single file: one Markdown that describes everything, with all configuration hardcoded. This works at small scale. As the workflow grows, three problems appear:
- Changing a timeout requires finding and updating multiple locations
- Subagent task prompts are scattered through the workflow definition, impossible to test independently
- Security policy and business logic are mixed together, making compliance review painful
The four-layer architecture addresses all three.
Four-Layer Architecture
Policy Layer policy.md Execution rules, global constraints
→ Who can do what; authorization for high-risk ops
Workflow Layer workflow.md Phase / Step structure, routing logic
→ The skeleton; no specific task content
TaskSpec Layer templates/ Subagent task prompt templates
→ Detailed instructions and output contracts per task
Tool/Skill Layer skills/ Atomic capabilities
→ Skill definitions reusable across workflows
Core principle: each layer changes only its own concerns — nothing crosses layers.
# ✅ Correct: change analysis timeout → edit workflow.md
phase_3_analyze:
timeout: 30m ← Workflow Layer change
# ✅ Correct: change analysis output format → edit templates/analyze.md
## Output Contract
{"confidence": float, "root_cause": str, ...} ← TaskSpec Layer change
# ❌ Wrong: write permission rules inside a task prompt (permissions belong in Policy Layer)
# ❌ Wrong: write specific analysis steps inside workflow.md (steps belong in TaskSpec Layer)
Layer separation makes changes safe: editing templates/ only affects the corresponding subagent's output. Editing policy.md cannot accidentally break routing logic.
Context Passing Modes
Deciding what a subagent should know is where workflow design goes wrong most often.
Passing the main Agent's full history to every subagent is the most common mistake. Context explodes: subagents slow down, output quality drops, token cost doubles.
Choose a passing mode based on what the subagent actually needs. Three modes:
accumulate
Definition: pass all relevant outputs the workflow has produced so far.
When to use: the subagent synthesizes conclusions from multiple earlier phases.
# Phase 7: write closing notification, needs conclusions from the whole workflow
phase_7_notify:
context_mode: accumulate
context_inputs:
- phases.phase3.root_cause_summary
- phases.phase4.fix_summary
- phases.phase5.commit_result
- phases.phase6.review_status
The Phase 7 subagent needs root cause, fix summary, commit outcome, and review status. Missing any one of them produces an incomplete notification.
last_only
Definition: pass only the output of the immediately preceding phase or step.
When to use: the subagent's task depends entirely on its direct predecessor; history is irrelevant.
# Phase 2: extract log files — only needs the attachment path from Phase 1
phase_2_extract_logs:
context_mode: last_only
context_inputs:
- phases.phase1.attachment_path # one field is all it needs
Extracting logs doesn't need the full Jira ticket details — just where the file is. Passing all of Phase 1's output wastes context. last_only enforces taking only what's needed.
explicit
Definition: name every specific field the subagent needs, sourcing from any prior phase.
When to use: the subagent needs specific fields from multiple phases, but not the complete output of any single phase.
# Phase 3: root cause analysis — needs bug_info (Phase 1) + log_dir (Phase 2)
phase_3_analyze:
context_mode: explicit
context_inputs:
- source: phases.phase1
fields: [bug_info.summary, bug_info.stack_trace, bug_info.jira_key]
- source: phases.phase2
fields: [log_dir, extracted_files]
Phase 3 needs the bug description (from Phase 1) and the log directory (from Phase 2), but not Phase 1's attachment path or Phase 2's raw extraction log. explicit mode controls precisely what flows into the subagent.
Choosing a Mode
Subagent synthesizes conclusions from multiple phases → accumulate
Subagent depends only on its direct predecessor → last_only
Subagent needs specific fields from multiple sources → explicit (recommended default)
explicit is the safest default. Even when you're unsure what a subagent needs, start by naming specific fields. It's easier to debug than over-passing, and it documents the data dependencies explicitly.
Approval Gate Design
Approval gates are the nodes where humans intervene. Incomplete gate definitions are a common source of production incidents.
Three Gate Types
interrupt (blocking)
Workflow pauses completely until human responds
For: high-risk operations (code merge, production deploy)
notification (non-blocking)
Workflow continues; human is notified in parallel
For: low-risk operations where awareness is enough
approval (async)
Asynchronous wait for approval within a specified time window
For: formal approval processes with SLA requirements
Five Required Fields
# Complete approval gate definition
- gate_id: gate_B
type: interrupt
trigger_condition: "fix_result.all_passed == false after 3 retries"
message: |
Fix attempts: 3 failures.
Root cause: {{ phases.phase3.root_cause_summary }}
Last error: {{ phases.phase4.last_error }}
Choose next action:
options:
- label: Manual intervention
value: manual_fix
- label: Re-analyze root cause
value: re_analyze
- label: Mark as requires manual fix
value: mark_manual
timeout: 24h
timeout_action: pause # ← most commonly omitted field
timeout_action is the most frequently missing field. Options:
pause → suspend workflow after timeout, wait for human to resume (most common)
continue → proceed with default option after timeout (low-risk notification gates)
abort → terminate the entire workflow after timeout (strict time-window operations)
A gate without timeout_action leaves the workflow hanging indefinitely: no alert fires, no record is written, no recovery path exists.
Approval Gate Message Design
The gate message is read by humans. It directly determines how fast decisions get made.
✅ Effective gate message:
"Test pass rate: 67% (8/12 passing)
Failing tests: test_null_input, test_overflow
Current fix: modified boundary check in parseInput()
Recommended action: re-analyze root cause — failure pattern
doesn't match the identified root cause"
❌ Ineffective gate message:
"Fix failed. Choose an action."
A good message lets the reviewer decide in 30 seconds. A bad one sends them to the logs.
Serial Retry vs Parallel Candidates
When a workflow encounters failure, two response strategies exist. Choosing the wrong one degrades efficiency or quality.
Serial Retry
Attempt 1 → fail
↓ (with failure reason + feedback)
Attempt 2 → fail
↓ (with failure reason + feedback)
Attempt 3 → pass
When to use: the error reason is concrete, later attempts can learn from earlier failures, and there's meaningful variation in angle or approach.
Example: root cause analysis (Phase 3)
Attempt 1: analyze from code perspective
Attempt 2: feedback "code analysis confidence low — try log anomaly patterns"
Attempt 3: feedback "try tracing the call chain chronologically"
Each retry applies the previous failure as a learning signal.
Parallel Candidates
→ Candidate A → test → pass ← select this
analysis → → Candidate B → test → fail
→ Candidate C → test → fail
When to use: the solution space is diverse, predicting which approach will work is impossible, and exploring multiple options concurrently then selecting the best is more efficient than serial exploration.
Example: code fix (Phase 4)
Candidate A: fix the boundary check logic
Candidate B: fix the caller's input validation
Candidate C: fix parseInput()'s default value handling
All three approaches could be correct. Run them concurrently,
select the one that passes tests with the highest coverage.
Selection Principle
Later attempts can learn from earlier failures → serial retry
Multiple approaches are equally plausible → parallel candidates
Time-sensitive, can't afford serial latency → parallel candidates
Comparing solution quality is the goal → parallel candidates
Document the strategy in the workflow definition — it makes debugging straightforward:
phase_3_analyze:
retry_strategy: serial # different angles, learning from failure
max_retries: 3
feedback_mode: include_prev_error
phase_4_fix:
retry_strategy: parallel # solution space is diverse, select best
parallel_candidates: 3
selection_criteria: passed_tests AND max_coverage
Design Checklist
Four-layer separation
- [ ] Policy (permissions/security) and Workflow (routing/structure) are in separate files
- [ ] Subagent task prompts live in independent
templates/files - [ ]
config.yamlcentralizes mutable parameters (timeouts, retry counts)
Context passing
- [ ] Every subagent invocation declares a
context_mode - [ ]
explicitmode lists specific fields — no whole Phase outputs passed in - [ ] Main Agent's full history is not sent to every subagent
Approval gates
- [ ] Every gate has both
timeoutandtimeout_action - [ ] Message contains enough information for a 30-second decision
- [ ] Option values are enumerated types, not free text
Retry strategy
- [ ] Every node with retry logic is labeled
serialorparallel - [ ] Serial retries have
feedback_mode(failure reason flows back to the generator) - [ ] Parallel candidates have a defined
selection_criteria
Summary
-
The four-layer architecture's value is isolation: Policy changes don't affect routing;
templates/changes are independently testable — this is the foundation of maintainability -
explicitis the safest default for Context passing: it names every field, costs fewer tokens thanaccumulate, is more flexible thanlast_only, and is easier to debug when something goes wrong - Serial retry vs parallel candidates is a directional choice: root cause analysis benefits from serial retry (learning is effective); code fix benefits from parallel candidates (solution space is diverse) — reversing them degrades both efficiency and quality
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage
Top comments (1)
TextStow could be useful for this workflow — clipboard history + reusable favorites + prompt templates + cleanup for JSON/PDF/URLs. Local-first, free: textstow.com