The Cost Blind Spot
A single Skill's cost is easy: input_tokens × price + output_tokens × price.
A workflow with 7 phases, multiple subagents per phase, Phase 4 running 3 concurrent candidates — how much did one run cost? Most teams can't answer, and without the answer, they can't optimize.
Cross-Phase Cost Tracking
Record Token Consumption in the State File
After each subagent call, write token consumption into workflow_state.json:
{
"workflow_id": "wf-bug-e2e-AE-33995-20260601",
"cost_tracking": {
"phase_1_jira": {
"model": "claude-sonnet-4-6",
"input_tokens": 850,
"output_tokens": 420,
"cost_usd": 0.0019
},
"phase_3_analyze": {
"model": "claude-opus-4-8",
"input_tokens": 15000,
"output_tokens": 500,
"cost_usd": 0.2625
},
"phase_4_candidate_a": {
"model": "claude-sonnet-4-6",
"input_tokens": 8000,
"output_tokens": 2000,
"cost_usd": 0.046
},
"phase_4_candidate_b": {"model": "claude-sonnet-4-6", "cost_usd": 0.0423},
"phase_4_candidate_c": {"model": "claude-sonnet-4-6", "cost_usd": 0.0498},
"total_usd": 0.5145
}
}
Collection: the LLM API response object includes a usage field. Write it to the state file after the subagent completes.
def invoke_subagent(phase_id: str, prompt: str, model: str) -> dict:
response = llm_client.messages.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
prices = {
"claude-sonnet-4-6": {"input": 0.003, "output": 0.015},
"claude-opus-4-8": {"input": 0.015, "output": 0.075},
}
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
price = prices[model]
cost_usd = (input_tokens * price["input"] + output_tokens * price["output"]) / 1000
return {
"output": response.content[0].text,
"cost": {
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": round(cost_usd, 4),
}
}
Cost Hotspot Analysis
After collecting data from several real runs, the distribution typically looks like this:
Phase Avg cost Share
──────────────────────────────────────
phase_3_analyze $0.26 51% ← most expensive: Opus + large log input
phase_4_fix (×3) $0.14 27% ← second: 3 concurrent candidates
phase_2_logs $0.006 1%
phase_1_jira $0.002 0.4%
phase_5_commit $0.003 0.6%
phase_7_notify $0.002 0.4%
──────────────────────────────────────
Total $0.51 100%
Phase 3 takes 51% because of two factors: Opus model selection and large log input (15,000 tokens). Phase 4 takes 27% because of three concurrent candidates.
Optimization Directions
Reduce Phase 3 cost:
# Current: Opus analyzes full logs
phase_3_config = {"model": "claude-opus-4-8", "context": "full_logs"} # expensive
# Option A: Sonnet filters key lines first, Opus analyzes only key lines
phase_3_config = {
"pre_filter": {"model": "claude-sonnet-4-6", "task": "extract_key_lines"},
"analysis": {"model": "claude-opus-4-8", "context": "key_lines_only"},
}
# Option B: Sonnet first; upgrade to Opus only if confidence is low
phase_3_config = {
"model": "claude-sonnet-4-6",
"fallback_model": "claude-opus-4-8",
"fallback_threshold": 0.7, # upgrade when confidence < 0.7
}
Reduce Phase 4 cost:
Phase 4 runs 3 candidates to maximize the probability of at least one passing. If historical data shows candidate pass rate above 80%, run one first and skip the rest if it passes:
phase_4_fix:
strategy: lazy_parallel # run 1 first; only run remaining 2 if it fails
max_candidates: 3
stop_on_first_pass: true
Fault Diagnosis Methodology
When a workflow fails, the default is to dig through logs, guess execution order, and manually verify each phase. A classification tree and standard diagnostic steps reduce this to under 5 minutes.
Fault Classification Tree
Workflow didn't complete
├── Never started
│ └── Trigger condition problem
│ → Check AGENTS.md trigger keywords
│ → Check input parameter format (jira_key format)
│
├── Stuck at a Phase
│ ├── Subagent spawn failed
│ │ → Check sessions_spawn parameters
│ │ → Check network and auth configuration
│ │
│ ├── Subagent timed out (output file missing)
│ │ → Check task prompt length (too long → slow LLM response)
│ │ → Check model RPM/TPM limits
│ │
│ └── Subagent failed (output file exists but passed=false)
│ → Read the error field in the output file
│ → Check the template's output contract declaration
│
├── Waiting at approval gate (timeout)
│ → Check timeout_action is set to "pause"
│ If "continue", verify the default option is correct
│
└── Resumed from wrong position
→ Read workflow_state.json phase/step status
in_progress phases re-execute (expected behavior)
Check version binding (W3)
5-Step Standard Diagnosis
# Step 1: Check current state
cat $WS/workflow_state.json | python3 -m json.tool | grep -A3 '"phase"'
# Step 2: Find first incomplete phase
cat $WS/workflow_state.json | python3 -c "
import json, sys
state = json.load(sys.stdin)
for phase_id, phase in state['phases'].items():
if phase.get('status') != 'done':
print(f'Stuck at: {phase_id} ({phase.get(\"status\", \"unknown\")})')
break
"
# Step 3: Check that phase's output directory
ls -la $WS/phase_4/
# Step 4: If output file exists, read the error field
cat $WS/phase_4/candidate_a.json | python3 -c "
import json, sys
r = json.load(sys.stdin)
if not r.get('passed'):
print('Error:', r.get('error', 'no error field'))
"
# Step 5: If Trace is configured, find the workflow in Langfuse
# Search by workflow_id, check each Phase's Span for latency and errors
Common Fault Quick Reference
Scenario 1: Stuck at Phase 3, no output file after 5+ minutes
Symptom: phase_3 status=in_progress, analysis_final.json missing
Likely causes:
1. Task prompt too long (full log injected into prompt) → check Phase 3 input size
2. Model rate limiting → check API call logs
3. Spawn failed without error record → check sessions_spawn logs
Fix: manually set phase_3 status=pending, re-trigger resume
Scenario 2: All Phase 4 candidates have passed=false
Symptom: candidate_a/b/c.json all exist, all passed=false, Gate B triggered
Likely causes:
1. Root cause analysis was wrong → read analysis_final.json root_cause, verify manually
2. Test runner errors unrelated to the fix → read candidate_a.json error field
Fix: through Gate B approval gate, select "re-analyze root cause"
Monthly Cost Report
# tools/cost_report.py
import json
from pathlib import Path
from collections import defaultdict
def generate_monthly_report(state_dir: Path) -> dict:
totals: dict = defaultdict(float)
run_count = 0
for state_file in state_dir.glob("**/workflow_state.json"):
state = json.loads(state_file.read_text())
cost_tracking = state.get("cost_tracking", {})
for phase_id, phase_cost in cost_tracking.items():
if phase_id != "total_usd" and isinstance(phase_cost, dict):
totals[phase_id] += phase_cost.get("cost_usd", 0)
totals["total"] += cost_tracking.get("total_usd", 0)
run_count += 1
return {
"run_count": run_count,
"total_cost_usd": round(totals["total"], 4),
"avg_cost_per_run": round(totals["total"] / run_count, 4) if run_count else 0,
"by_phase": {k: round(v, 4) for k, v in totals.items() if k != "total"},
"top_cost_driver": max(
(k for k in totals if k != "total"),
key=lambda k: totals[k],
default=None,
),
}
Design Checklist
Cost tracking
- [ ] Token consumption written to state file after every subagent call
- [ ] State file includes
cost_tracking.total_usd - [ ] Tool available to aggregate cross-run costs and identify hotspot phases
Cost optimization
- [ ] Highest-cost Phase evaluated for model downgrade (Sonnet replacing Opus)
- [ ] Concurrent candidate count backed by historical pass rate data
- [ ] High-input-volume Phases evaluated for pre-filtering before LLM call
Fault diagnosis
- [ ] Fault classification tree covers all four failure modes
- [ ] 5-step shell diagnostic locates the failing Phase in under 5 minutes
- [ ] Common fault scenarios have documented fix operations
Summary
- Cost concentrates in 1-2 phases: Phase 3 (Opus + large input) and Phase 4 (3 parallel candidates) typically account for 75% of total cost — optimizing these two has more impact than all other phases combined
- Classify before diagnosing: identifying which category a failure belongs to (never started / phase stuck / gate timeout / resume error) points directly to the right check — faster than reading logs from the beginning
-
Prepare diagnostic tools before problems occur:
cost_report.pyanddiagnose.shshould exist before anything breaks; when something does break, open and run
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage
Top comments (0)