Why Workflows Need a Dedicated Evaluation Framework
Traditional software testing covers code correctness. Workflows add two layers of uncertainty:
- LLM output is non-deterministic: the same input can produce different results across runs
- Cross-step dependencies: a Phase 3 problem may only surface at Phase 7, making the debugging chain long
Without an evaluation framework, every workflow change requires a full end-to-end run: slow, expensive, incomplete coverage. Three-layer testing decomposes the problem.
Three-Layer Evaluation Structure
Layer 3: End-to-end tests (Workflow level)
Full pipeline from trigger to completion
Test cases: eval/cases.yaml
Metrics: completion rate, Phase 4 avg rounds, gate trigger rate
Layer 2: Integration tests (Phase level)
Cross-step data flow is correctly passed
Cross-phase routing logic fires correctly
Layer 1: Unit tests (Step level)
Each subagent's output matches its output contract
No real LLM calls — validates JSON schema only
Test priority: Layer 1 should be the most numerous and fastest — catches contract violations in seconds. Layer 3 is the slowest and most expensive — run it only when changes affect the main pipeline.
Layer 1: Step-Level Unit Tests
Unit tests verify that subagent output files match the declared schema. No real LLM calls needed.
# tests/unit/test_phase3_output.py
import json
from pathlib import Path
def test_analysis_output_schema():
"""Phase 3 output must conform to analysis_final.json schema"""
output = json.loads(Path("test_fixtures/phase3/analysis_final.json").read_text())
assert "passed" in output
assert isinstance(output["passed"], bool)
assert "confidence" in output
assert 0.0 <= output["confidence"] <= 1.0
assert "root_cause" in output
assert isinstance(output["root_cause"], str | type(None))
assert "evidence" in output
assert isinstance(output["evidence"], list)
# on failure, error field must be present and non-empty
if not output["passed"]:
assert "error" in output
assert output["error"]
def test_fix_candidate_output_schema():
"""Phase 4 candidate output schema"""
for candidate in ["candidate_a", "candidate_b", "candidate_c"]:
output_file = Path(f"test_fixtures/phase4/{candidate}.json")
if output_file.exists():
output = json.loads(output_file.read_text())
assert "passed" in output
assert "test_coverage" in output
assert isinstance(output["test_coverage"], float)
Test fixtures: save real run outputs as test data, with one successful path and one failure path per subagent. The fixtures document exactly what the contract looks like in practice.
Layer 2: Integration Tests
Integration tests cover two problem types:
Data flow tests: verify that Phase N's output can be consumed by Phase N+1.
# tests/integration/test_phase_data_flow.py
def test_phase1_output_satisfies_phase2_context():
"""Phase 1's bug_info.json must include all fields declared in Phase 2's context_inputs"""
bug_info = json.loads(Path("test_fixtures/phase1/bug_info.json").read_text())
required_fields = ["summary", "stack_trace", "jira_key", "attachment_path"]
for field in required_fields:
assert field in bug_info, f"Phase 1 output missing field required by Phase 2: {field}"
def test_phase3_routing_logic():
"""Phase 3 completion triggers correct routing based on confidence"""
# high confidence → proceed to Phase 4
high_conf = {"passed": True, "confidence": 0.97, "root_cause": "NPE in parseInput"}
assert route_after_phase3(high_conf) == "phase_4"
# medium confidence → trigger Gate A
mid_conf = {"passed": True, "confidence": 0.75, "root_cause": "..."}
assert route_after_phase3(mid_conf) == "gate_A"
# low confidence + retries remaining → retry Phase 3
low_conf = {"passed": False, "confidence": 0.45}
assert route_after_phase3(low_conf, retry_count=1) == "phase_3_retry"
# low confidence + retries exhausted → human escalation
assert route_after_phase3(low_conf, retry_count=3) == "human_escalation"
Routing logic implemented as a pure Python function runs all edge cases in milliseconds with no LLM calls.
Layer 3: End-to-End Tests and Metric Baselines
Test Case Definitions
# eval/cases.yaml
cases:
- id: WF-E2E-001
name: Happy path (high confidence, first-attempt pass)
input:
jira_key: AE-MOCK-001
bug_description: "NullPointerException in parseInput() when config=null"
expected_flow:
- phase_1: done
- phase_2: done
- phase_3: done (confidence >= 0.95)
- phase_4: done (first candidate passes)
- phase_5: done
- phase_6: done
- phase_7: done
expected_metrics:
e2e_success: true
phase4_rounds: 1
gates_triggered: []
- id: WF-E2E-002
name: Low confidence path (Gate A triggered)
input:
jira_key: AE-MOCK-002
bug_description: "Intermittent crash, no reproducible steps"
expected_flow:
- phase_3: done (confidence < 0.95)
- gate_A: triggered
- id: WF-E2E-003
name: Fix failure path (all candidates fail, Gate B triggered)
input:
jira_key: AE-MOCK-003
expected_flow:
- phase_4: all candidates failed
- gate_B: triggered
expected_metrics:
phase4_rounds: 3
gates_triggered: [gate_B]
Core Metric Definitions
End-to-end completion rate > 70%
= fully automated completions / total triggers
Phase 4 average rounds < 1.5
= mean phase4_rounds across all runs
(close to 1: fix quality is good; close to 3: test pass rate is low)
Parallel candidate pass rate > 80%
= fraction of workflows where at least 1 candidate passed
(below 80%: root cause analysis quality or fix strategy needs work)
Gate trigger rate < 20%
= fraction of workflows that triggered any human gate
(above 20%: LLM quality or input data quality has a problem)
Regression Testing
Before modifying workflow.md / templates / policy.md, establish a baseline with historical cases:
# Step 1: run eval before changes, record baseline
python run_eval.py --cases eval/cases.yaml --output baseline_v1.3.json
# Step 2: make workflow changes
# ...
# Step 3: run the same cases again
python run_eval.py --cases eval/cases.yaml --output baseline_v1.4.json
# Step 4: compare delta
python compare_eval.py baseline_v1.3.json baseline_v1.4.json
# compare_eval.py output
Metric v1.3 v1.4 Delta
───────────────────────────────────────────
e2e_success_rate 78% 82% +4% ✓
phase4_avg_rounds 1.6 1.4 -0.2 ✓
gate_trigger_rate 18% 22% +4% ⚠️ (above threshold)
gate_trigger_rate crossing 20% means this change makes certain paths more likely to trigger human review. Investigate before releasing.
Trace Tracking
Without Trace, every workflow run is a black box. When something goes wrong, the team digs through files, compares timestamps, and guesses execution order. With Langfuse, every run has a queryable chain — open the trace, find the phase, read the span.
Three-Layer Trace Structure
from langfuse import Langfuse
langfuse = Langfuse()
def run_workflow(jira_key: str) -> None:
# Workflow-level trace (top layer)
trace = langfuse.trace(
name=f"wf-bug-e2e:{jira_key}",
input={"jira_key": jira_key},
metadata={"workflow_version": "1.3.0"}
)
for phase_id in get_pending_phases():
# Phase-level span
span = trace.span(
name=phase_id,
input={"context": get_phase_context(phase_id)}
)
result = execute_phase(phase_id)
span.end(
output={"status": result["status"], "passed": result["passed"]},
level="DEFAULT" if result["passed"] else "WARNING"
)
if gate_triggered:
trace.event(
name="human_gate_A",
metadata={"triggered_by": "low_confidence", "value": confidence}
)
What Trace Answers
How long did each phase take?
→ span start/end timestamps
Which phase consumed the most tokens?
→ span usage field
What was the raw error when a subagent failed?
→ span output.error field
Is Phase 3 confidence within a healthy range across runs?
→ span output.confidence, aggregated across multiple traces
No more guessing execution order or digging through files.
Design Checklist
Unit tests (Layer 1)
- [ ] Every subagent output has a schema validation test
- [ ] Fixtures cover both success and failure paths
- [ ] No real LLM calls — use saved real outputs as fixtures
Integration tests (Layer 2)
- [ ] Each phase's output fields align with the next phase's context_inputs
- [ ] All routing conditions (high/mid/low confidence, timeout, failure) have test coverage
- [ ] Routing logic is implemented as a pure function, runnable in milliseconds
End-to-end tests (Layer 3)
- [ ] eval/cases.yaml covers happy path, low-confidence path, fix-failure path
- [ ] 4 core metrics have defined thresholds
- [ ] Baseline delta comparison runs before every release; threshold violations block release
Trace tracking
- [ ] Every workflow run has a top-level trace
- [ ] Every phase has a span recording input, output, and latency
- [ ] Human gate triggers are recorded as events with reason metadata
Summary
- Three layers, three speeds: Layer 1 validates contracts with fixtures in seconds, Layer 2 tests data flow and routing in seconds, Layer 3 runs the full pipeline in minutes — the first two catch most problems before Layer 3 runs
- Metric baselines are release gates: if end-to-end completion rate, Phase 4 rounds, candidate pass rate, or gate trigger rate crosses a threshold, the change needs investigation
- Trace turns black boxes into queryable records: no more guessing execution order or digging through files — search the Langfuse trace for the run and read the span
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage
Top comments (0)