WonderLab

Posted on Jul 5

Workflow Series (08): Operations and Cost — Cross-Phase Cost Tracking and Fault Diagnosis

#ai #workflow #token #operations

The Cost Blind Spot

A single Skill's cost is easy: input_tokens × price + output_tokens × price.

A workflow with 7 phases, multiple subagents per phase, Phase 4 running 3 concurrent candidates — how much did one run cost? Most teams can't answer, and without the answer, they can't optimize.

Cross-Phase Cost Tracking

Record Token Consumption in the State File

After each subagent call, write token consumption into workflow_state.json:

{
  "workflow_id": "wf-bug-e2e-AE-33995-20260601",
  "cost_tracking": {
    "phase_1_jira": {
      "model": "claude-sonnet-4-6",
      "input_tokens": 850,
      "output_tokens": 420,
      "cost_usd": 0.0019
    },
    "phase_3_analyze": {
      "model": "claude-opus-4-8",
      "input_tokens": 15000,
      "output_tokens": 500,
      "cost_usd": 0.2625
    },
    "phase_4_candidate_a": {
      "model": "claude-sonnet-4-6",
      "input_tokens": 8000,
      "output_tokens": 2000,
      "cost_usd": 0.046
    },
    "phase_4_candidate_b": {"model": "claude-sonnet-4-6", "cost_usd": 0.0423},
    "phase_4_candidate_c": {"model": "claude-sonnet-4-6", "cost_usd": 0.0498},
    "total_usd": 0.5145
  }
}

Collection: the LLM API response object includes a usage field. Write it to the state file after the subagent completes.

def invoke_subagent(phase_id: str, prompt: str, model: str) -> dict:
    response = llm_client.messages.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    prices = {
        "claude-sonnet-4-6": {"input": 0.003, "output": 0.015},
        "claude-opus-4-8":   {"input": 0.015, "output": 0.075},
    }
    input_tokens = response.usage.input_tokens
    output_tokens = response.usage.output_tokens
    price = prices[model]
    cost_usd = (input_tokens * price["input"] + output_tokens * price["output"]) / 1000

    return {
        "output": response.content[0].text,
        "cost": {
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost_usd": round(cost_usd, 4),
        }
    }

Cost Hotspot Analysis

After collecting data from several real runs, the distribution typically looks like this:

Phase                Avg cost   Share
──────────────────────────────────────
phase_3_analyze       $0.26     51%   ← most expensive: Opus + large log input
phase_4_fix (×3)      $0.14     27%   ← second: 3 concurrent candidates
phase_2_logs          $0.006     1%
phase_1_jira          $0.002     0.4%
phase_5_commit        $0.003     0.6%
phase_7_notify        $0.002     0.4%
──────────────────────────────────────
Total                 $0.51    100%

Phase 3 takes 51% because of two factors: Opus model selection and large log input (15,000 tokens). Phase 4 takes 27% because of three concurrent candidates.

Optimization Directions

Reduce Phase 3 cost:

# Current: Opus analyzes full logs
phase_3_config = {"model": "claude-opus-4-8", "context": "full_logs"}  # expensive

# Option A: Sonnet filters key lines first, Opus analyzes only key lines
phase_3_config = {
    "pre_filter": {"model": "claude-sonnet-4-6", "task": "extract_key_lines"},
    "analysis":  {"model": "claude-opus-4-8", "context": "key_lines_only"},
}

# Option B: Sonnet first; upgrade to Opus only if confidence is low
phase_3_config = {
    "model": "claude-sonnet-4-6",
    "fallback_model": "claude-opus-4-8",
    "fallback_threshold": 0.7,    # upgrade when confidence < 0.7
}

Reduce Phase 4 cost:

Phase 4 runs 3 candidates to maximize the probability of at least one passing. If historical data shows candidate pass rate above 80%, run one first and skip the rest if it passes:

phase_4_fix:
  strategy: lazy_parallel       # run 1 first; only run remaining 2 if it fails
  max_candidates: 3
  stop_on_first_pass: true

Fault Diagnosis Methodology

When a workflow fails, the default is to dig through logs, guess execution order, and manually verify each phase. A classification tree and standard diagnostic steps reduce this to under 5 minutes.

Fault Classification Tree

Workflow didn't complete
├── Never started
│   └── Trigger condition problem
│       → Check AGENTS.md trigger keywords
│       → Check input parameter format (jira_key format)
│
├── Stuck at a Phase
│   ├── Subagent spawn failed
│   │   → Check sessions_spawn parameters
│   │   → Check network and auth configuration
│   │
│   ├── Subagent timed out (output file missing)
│   │   → Check task prompt length (too long → slow LLM response)
│   │   → Check model RPM/TPM limits
│   │
│   └── Subagent failed (output file exists but passed=false)
│       → Read the error field in the output file
│       → Check the template's output contract declaration
│
├── Waiting at approval gate (timeout)
│   → Check timeout_action is set to "pause"
│     If "continue", verify the default option is correct
│
└── Resumed from wrong position
    → Read workflow_state.json phase/step status
      in_progress phases re-execute (expected behavior)
      Check version binding (W3)

5-Step Standard Diagnosis

# Step 1: Check current state
cat $WS/workflow_state.json | python3 -m json.tool | grep -A3 '"phase"'

# Step 2: Find first incomplete phase
cat $WS/workflow_state.json | python3 -c "
import json, sys
state = json.load(sys.stdin)
for phase_id, phase in state['phases'].items():
    if phase.get('status') != 'done':
        print(f'Stuck at: {phase_id} ({phase.get(\"status\", \"unknown\")})')
        break
"

# Step 3: Check that phase's output directory
ls -la $WS/phase_4/

# Step 4: If output file exists, read the error field
cat $WS/phase_4/candidate_a.json | python3 -c "
import json, sys
r = json.load(sys.stdin)
if not r.get('passed'):
    print('Error:', r.get('error', 'no error field'))
"

# Step 5: If Trace is configured, find the workflow in Langfuse
# Search by workflow_id, check each Phase's Span for latency and errors

Common Fault Quick Reference

Scenario 1: Stuck at Phase 3, no output file after 5+ minutes

Symptom: phase_3 status=in_progress, analysis_final.json missing

Likely causes:
  1. Task prompt too long (full log injected into prompt) → check Phase 3 input size
  2. Model rate limiting → check API call logs
  3. Spawn failed without error record → check sessions_spawn logs

Fix: manually set phase_3 status=pending, re-trigger resume

Scenario 2: All Phase 4 candidates have passed=false

Symptom: candidate_a/b/c.json all exist, all passed=false, Gate B triggered

Likely causes:
  1. Root cause analysis was wrong → read analysis_final.json root_cause, verify manually
  2. Test runner errors unrelated to the fix → read candidate_a.json error field

Fix: through Gate B approval gate, select "re-analyze root cause"

Monthly Cost Report

# tools/cost_report.py
import json
from pathlib import Path
from collections import defaultdict

def generate_monthly_report(state_dir: Path) -> dict:
    totals: dict = defaultdict(float)
    run_count = 0

    for state_file in state_dir.glob("**/workflow_state.json"):
        state = json.loads(state_file.read_text())
        cost_tracking = state.get("cost_tracking", {})

        for phase_id, phase_cost in cost_tracking.items():
            if phase_id != "total_usd" and isinstance(phase_cost, dict):
                totals[phase_id] += phase_cost.get("cost_usd", 0)

        totals["total"] += cost_tracking.get("total_usd", 0)
        run_count += 1

    return {
        "run_count": run_count,
        "total_cost_usd": round(totals["total"], 4),
        "avg_cost_per_run": round(totals["total"] / run_count, 4) if run_count else 0,
        "by_phase": {k: round(v, 4) for k, v in totals.items() if k != "total"},
        "top_cost_driver": max(
            (k for k in totals if k != "total"),
            key=lambda k: totals[k],
            default=None,
        ),
    }

Design Checklist

Cost tracking

[ ] Token consumption written to state file after every subagent call
[ ] State file includes cost_tracking.total_usd
[ ] Tool available to aggregate cross-run costs and identify hotspot phases

Cost optimization

[ ] Highest-cost Phase evaluated for model downgrade (Sonnet replacing Opus)
[ ] Concurrent candidate count backed by historical pass rate data
[ ] High-input-volume Phases evaluated for pre-filtering before LLM call

Fault diagnosis

[ ] Fault classification tree covers all four failure modes
[ ] 5-step shell diagnostic locates the failing Phase in under 5 minutes
[ ] Common fault scenarios have documented fix operations

Summary

Cost concentrates in 1-2 phases: Phase 3 (Opus + large input) and Phase 4 (3 parallel candidates) typically account for 75% of total cost — optimizing these two has more impact than all other phases combined
Classify before diagnosing: identifying which category a failure belongs to (never started / phase stuck / gate timeout / resume error) points directly to the right check — faster than reading logs from the beginning
Prepare diagnostic tools before problems occur: cost_report.py and diagnose.sh should exist before anything breaks; when something does break, open and run

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

DEV Community