DEV Community

Cover image for Workflow Series (05): Evaluation Framework — Three-Layer Testing and Trace Tracking
WonderLab
WonderLab

Posted on

Workflow Series (05): Evaluation Framework — Three-Layer Testing and Trace Tracking

Why Workflows Need a Dedicated Evaluation Framework

Traditional software testing covers code correctness. Workflows add two layers of uncertainty:

  • LLM output is non-deterministic: the same input can produce different results across runs
  • Cross-step dependencies: a Phase 3 problem may only surface at Phase 7, making the debugging chain long

Without an evaluation framework, every workflow change requires a full end-to-end run: slow, expensive, incomplete coverage. Three-layer testing decomposes the problem.


Three-Layer Evaluation Structure

Layer 3: End-to-end tests (Workflow level)
  Full pipeline from trigger to completion
  Test cases: eval/cases.yaml
  Metrics: completion rate, Phase 4 avg rounds, gate trigger rate

Layer 2: Integration tests (Phase level)
  Cross-step data flow is correctly passed
  Cross-phase routing logic fires correctly

Layer 1: Unit tests (Step level)
  Each subagent's output matches its output contract
  No real LLM calls — validates JSON schema only
Enter fullscreen mode Exit fullscreen mode

Test priority: Layer 1 should be the most numerous and fastest — catches contract violations in seconds. Layer 3 is the slowest and most expensive — run it only when changes affect the main pipeline.


Layer 1: Step-Level Unit Tests

Unit tests verify that subagent output files match the declared schema. No real LLM calls needed.

# tests/unit/test_phase3_output.py
import json
from pathlib import Path

def test_analysis_output_schema():
    """Phase 3 output must conform to analysis_final.json schema"""
    output = json.loads(Path("test_fixtures/phase3/analysis_final.json").read_text())

    assert "passed" in output
    assert isinstance(output["passed"], bool)
    assert "confidence" in output
    assert 0.0 <= output["confidence"] <= 1.0
    assert "root_cause" in output
    assert isinstance(output["root_cause"], str | type(None))
    assert "evidence" in output
    assert isinstance(output["evidence"], list)

    # on failure, error field must be present and non-empty
    if not output["passed"]:
        assert "error" in output
        assert output["error"]

def test_fix_candidate_output_schema():
    """Phase 4 candidate output schema"""
    for candidate in ["candidate_a", "candidate_b", "candidate_c"]:
        output_file = Path(f"test_fixtures/phase4/{candidate}.json")
        if output_file.exists():
            output = json.loads(output_file.read_text())
            assert "passed" in output
            assert "test_coverage" in output
            assert isinstance(output["test_coverage"], float)
Enter fullscreen mode Exit fullscreen mode

Test fixtures: save real run outputs as test data, with one successful path and one failure path per subagent. The fixtures document exactly what the contract looks like in practice.


Layer 2: Integration Tests

Integration tests cover two problem types:

Data flow tests: verify that Phase N's output can be consumed by Phase N+1.

# tests/integration/test_phase_data_flow.py

def test_phase1_output_satisfies_phase2_context():
    """Phase 1's bug_info.json must include all fields declared in Phase 2's context_inputs"""
    bug_info = json.loads(Path("test_fixtures/phase1/bug_info.json").read_text())

    required_fields = ["summary", "stack_trace", "jira_key", "attachment_path"]
    for field in required_fields:
        assert field in bug_info, f"Phase 1 output missing field required by Phase 2: {field}"

def test_phase3_routing_logic():
    """Phase 3 completion triggers correct routing based on confidence"""
    # high confidence → proceed to Phase 4
    high_conf = {"passed": True, "confidence": 0.97, "root_cause": "NPE in parseInput"}
    assert route_after_phase3(high_conf) == "phase_4"

    # medium confidence → trigger Gate A
    mid_conf = {"passed": True, "confidence": 0.75, "root_cause": "..."}
    assert route_after_phase3(mid_conf) == "gate_A"

    # low confidence + retries remaining → retry Phase 3
    low_conf = {"passed": False, "confidence": 0.45}
    assert route_after_phase3(low_conf, retry_count=1) == "phase_3_retry"

    # low confidence + retries exhausted → human escalation
    assert route_after_phase3(low_conf, retry_count=3) == "human_escalation"
Enter fullscreen mode Exit fullscreen mode

Routing logic implemented as a pure Python function runs all edge cases in milliseconds with no LLM calls.


Layer 3: End-to-End Tests and Metric Baselines

Test Case Definitions

# eval/cases.yaml
cases:
  - id: WF-E2E-001
    name: Happy path (high confidence, first-attempt pass)
    input:
      jira_key: AE-MOCK-001
      bug_description: "NullPointerException in parseInput() when config=null"
    expected_flow:
      - phase_1: done
      - phase_2: done
      - phase_3: done (confidence >= 0.95)
      - phase_4: done (first candidate passes)
      - phase_5: done
      - phase_6: done
      - phase_7: done
    expected_metrics:
      e2e_success: true
      phase4_rounds: 1
      gates_triggered: []

  - id: WF-E2E-002
    name: Low confidence path (Gate A triggered)
    input:
      jira_key: AE-MOCK-002
      bug_description: "Intermittent crash, no reproducible steps"
    expected_flow:
      - phase_3: done (confidence < 0.95)
      - gate_A: triggered

  - id: WF-E2E-003
    name: Fix failure path (all candidates fail, Gate B triggered)
    input:
      jira_key: AE-MOCK-003
    expected_flow:
      - phase_4: all candidates failed
      - gate_B: triggered
    expected_metrics:
      phase4_rounds: 3
      gates_triggered: [gate_B]
Enter fullscreen mode Exit fullscreen mode

Core Metric Definitions

End-to-end completion rate   > 70%
  = fully automated completions / total triggers

Phase 4 average rounds       < 1.5
  = mean phase4_rounds across all runs
  (close to 1: fix quality is good; close to 3: test pass rate is low)

Parallel candidate pass rate > 80%
  = fraction of workflows where at least 1 candidate passed
  (below 80%: root cause analysis quality or fix strategy needs work)

Gate trigger rate            < 20%
  = fraction of workflows that triggered any human gate
  (above 20%: LLM quality or input data quality has a problem)
Enter fullscreen mode Exit fullscreen mode

Regression Testing

Before modifying workflow.md / templates / policy.md, establish a baseline with historical cases:

# Step 1: run eval before changes, record baseline
python run_eval.py --cases eval/cases.yaml --output baseline_v1.3.json

# Step 2: make workflow changes
# ...

# Step 3: run the same cases again
python run_eval.py --cases eval/cases.yaml --output baseline_v1.4.json

# Step 4: compare delta
python compare_eval.py baseline_v1.3.json baseline_v1.4.json
Enter fullscreen mode Exit fullscreen mode
# compare_eval.py output
Metric               v1.3    v1.4    Delta
───────────────────────────────────────────
e2e_success_rate     78%     82%     +4%  ✓
phase4_avg_rounds    1.6     1.4     -0.2 ✓
gate_trigger_rate    18%     22%     +4%  ⚠️ (above threshold)
Enter fullscreen mode Exit fullscreen mode

gate_trigger_rate crossing 20% means this change makes certain paths more likely to trigger human review. Investigate before releasing.


Trace Tracking

Without Trace, every workflow run is a black box. When something goes wrong, the team digs through files, compares timestamps, and guesses execution order. With Langfuse, every run has a queryable chain — open the trace, find the phase, read the span.

Three-Layer Trace Structure

from langfuse import Langfuse

langfuse = Langfuse()

def run_workflow(jira_key: str) -> None:
    # Workflow-level trace (top layer)
    trace = langfuse.trace(
        name=f"wf-bug-e2e:{jira_key}",
        input={"jira_key": jira_key},
        metadata={"workflow_version": "1.3.0"}
    )

    for phase_id in get_pending_phases():
        # Phase-level span
        span = trace.span(
            name=phase_id,
            input={"context": get_phase_context(phase_id)}
        )

        result = execute_phase(phase_id)

        span.end(
            output={"status": result["status"], "passed": result["passed"]},
            level="DEFAULT" if result["passed"] else "WARNING"
        )

    if gate_triggered:
        trace.event(
            name="human_gate_A",
            metadata={"triggered_by": "low_confidence", "value": confidence}
        )
Enter fullscreen mode Exit fullscreen mode

What Trace Answers

How long did each phase take?
  → span start/end timestamps

Which phase consumed the most tokens?
  → span usage field

What was the raw error when a subagent failed?
  → span output.error field

Is Phase 3 confidence within a healthy range across runs?
  → span output.confidence, aggregated across multiple traces
Enter fullscreen mode Exit fullscreen mode

No more guessing execution order or digging through files.


Design Checklist

Unit tests (Layer 1)

  • [ ] Every subagent output has a schema validation test
  • [ ] Fixtures cover both success and failure paths
  • [ ] No real LLM calls — use saved real outputs as fixtures

Integration tests (Layer 2)

  • [ ] Each phase's output fields align with the next phase's context_inputs
  • [ ] All routing conditions (high/mid/low confidence, timeout, failure) have test coverage
  • [ ] Routing logic is implemented as a pure function, runnable in milliseconds

End-to-end tests (Layer 3)

  • [ ] eval/cases.yaml covers happy path, low-confidence path, fix-failure path
  • [ ] 4 core metrics have defined thresholds
  • [ ] Baseline delta comparison runs before every release; threshold violations block release

Trace tracking

  • [ ] Every workflow run has a top-level trace
  • [ ] Every phase has a span recording input, output, and latency
  • [ ] Human gate triggers are recorded as events with reason metadata

Summary

  1. Three layers, three speeds: Layer 1 validates contracts with fixtures in seconds, Layer 2 tests data flow and routing in seconds, Layer 3 runs the full pipeline in minutes — the first two catch most problems before Layer 3 runs
  2. Metric baselines are release gates: if end-to-end completion rate, Phase 4 rounds, candidate pass rate, or gate trigger rate crosses a threshold, the change needs investigation
  3. Trace turns black boxes into queryable records: no more guessing execution order or digging through files — search the Langfuse trace for the run and read the span

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

Top comments (0)