DEV Community

Cover image for Workflow Series (07): Engineering and Version Management — CI/CD for Workflows
WonderLab
WonderLab

Posted on

Workflow Series (07): Engineering and Version Management — CI/CD for Workflows

Why Workflows Need CI

Code changes: CI runs tests and catches problems. A Workflow's "code" is Markdown + YAML. What catches problems there?

Three failures that typically reach runtime undetected:

  • Add an output field in templates/analyze.md, forget to declare it in workflow.md's context_inputs. The downstream phase receives nothing and silently continues.
  • Change the routing confidence threshold from 0.95 to 0.9, skip updating the routing tests. Edge case behavior shifts; you find out when a workflow runs in production.
  • Delete a template file with an active reference in workflow.md.

All three are catchable at commit time with automated checks, not at runtime.


Three CI Gates

Gate 1: Static validation (seconds, runs on every commit)
  - All referenced template files exist
  - Skills in config.yaml exist in the registry
  - Every phase's on_success / on_failure target is a known phase or reserved keyword

Gate 2: Schema tests (minutes, runs on every commit)
  - context_inputs declarations align with actual upstream output fields
  - No real LLM calls — validates data contracts only
  - Corresponds to Layer 1 + Layer 2 tests from the Evaluation article (W5)

Gate 3: End-to-end regression (hours, runs before merge)
  - Run eval/cases.yaml happy path through the full workflow
  - Compare results against baseline metrics
  - Corresponds to Layer 3 tests from W5
Enter fullscreen mode Exit fullscreen mode

Gate 1: Static Validation Script

Gate 1 doesn't call LLM. Pure filesystem checks, completes in seconds:

#!/usr/bin/env python3
# tools/validate_workflow.py

import sys
import re
import yaml
from pathlib import Path

SKILL_DIR = Path("skills/wf-bug-e2e")
TEMPLATES_DIR = SKILL_DIR / "templates"
ERRORS = []


def check_template_references():
    """All templates referenced in workflow.md must exist on disk"""
    content = (SKILL_DIR / "workflow.md").read_text()
    refs = re.findall(r"template:\s*(\S+\.md)", content)

    for ref in refs:
        if not (TEMPLATES_DIR / ref).exists():
            ERRORS.append(f"Template not found: templates/{ref} (referenced in workflow.md)")


def check_phase_routing():
    """Every on_success / on_failure target must be a known phase or reserved keyword"""
    content = (SKILL_DIR / "workflow.md").read_text()
    phases = set(re.findall(r"^phase_(\w+):", content, re.MULTILINE))
    targets = re.findall(r"(?:on_success|on_failure|continue_to):\s*(\S+)", content)

    reserved = {"END", "human_escalation", "gate_A", "gate_B", "gate_C"}
    for target in targets:
        phase_name = target.replace("phase_", "")
        if target not in reserved and phase_name not in phases:
            ERRORS.append(f"Routing target not found: '{target}'")


def check_config_skills():
    """Skills referenced in config.yaml must exist in the registry"""
    config_file = SKILL_DIR / "config.yaml"
    registry_file = Path("skills/registry.yaml")
    if not config_file.exists() or not registry_file.exists():
        return

    config = yaml.safe_load(config_file.read_text())
    registry = yaml.safe_load(registry_file.read_text())
    registered_ids = {s["id"] for s in registry.get("skills", [])}

    for phase_config in config.get("phases", {}).values():
        skill_id = phase_config.get("skill")
        if skill_id and skill_id not in registered_ids:
            ERRORS.append(f"Skill not in registry: '{skill_id}' (check config.yaml)")


def main():
    check_template_references()
    check_phase_routing()
    check_config_skills()

    if ERRORS:
        print("❌ Workflow validation failed:")
        for e in ERRORS:
            print(f"  - {e}")
        sys.exit(1)

    print("✅ Workflow validation passed")


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Wired into CI (GitHub Actions):

# .github/workflows/workflow-ci.yml
name: Workflow CI

on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install pyyaml
      - name: Gate 1 — Static validation
        run: python tools/validate_workflow.py

  schema-tests:
    runs-on: ubuntu-latest
    needs: validate
    steps:
      - uses: actions/checkout@v4
      - run: pip install pytest
      - name: Gate 2 — Schema tests
        run: pytest tests/unit/ tests/integration/ -v
Enter fullscreen mode Exit fullscreen mode

Gate 2: Data Contract Verification

Gate 2 verifies that every Phase's declared context_inputs aligns with the actual output fields from upstream phases.

# tests/integration/test_context_alignment.py

import yaml, json, re
from pathlib import Path

def load_context_inputs(phase_id: str) -> list[str]:
    config = yaml.safe_load(Path("skills/wf-bug-e2e/config.yaml").read_text())
    return config["phases"][phase_id].get("context_inputs", [])

def load_output_fields(phase_id: str) -> set[str]:
    template = Path(f"skills/wf-bug-e2e/templates/{phase_id}.md").read_text()
    schema_match = re.search(r"```

json\n({.*?})\n

```", template, re.DOTALL)
    if schema_match:
        return set(json.loads(schema_match.group(1)).keys())
    return set()

def test_phase3_context_alignment():
    phase3_inputs = load_context_inputs("phase_3")
    phase1_outputs = load_output_fields("phase_1")

    for input_decl in phase3_inputs:
        if input_decl.startswith("phases.phase1."):
            field = input_decl.replace("phases.phase1.", "")
            assert field in phase1_outputs, \
                f"Phase 3 needs '{field}' but Phase 1 output schema doesn't include it"
Enter fullscreen mode Exit fullscreen mode

Version Number Rules

Workflow files are code. Every change deserves a version.

MAJOR.MINOR.PATCH

MAJOR: Phase structure changes
  - Adding or removing a Phase
  - Major routing logic changes (affects main pipeline conditions)
  - Breaking changes to a subagent output schema
  → Risk of breaking in-progress workflow runs
  → Resume protocol must check version compatibility (see W3)

MINOR: Additive changes, backward compatible
  - Adding a Step inside an existing Phase
  - Adding gate options
  - Template improvements (no field changes)
  → In-progress runs complete with the old version
  → New triggers use the new version

PATCH: Wording and configuration adjustments
  - Prompt wording improvements
  - Timeout adjustments
  - Comment changes
  → Safe update; old state files resume without issue
Enter fullscreen mode Exit fullscreen mode

Where version numbers live:

# SKILL.md (workflow entry file)
---
name: wf-bug-e2e
version: 1.3.0      ← update before each release
last_updated: 2026-06-01
---
Enter fullscreen mode Exit fullscreen mode
// workflow_state.json (bound at run time, verified on resume)
{
  "workflow_version": "1.3.0",
  ...
}
Enter fullscreen mode Exit fullscreen mode

Release Process

Step 1: Document the reason
  Write in CHANGELOG.md: why this change? what changed?
  Not "optimized some logic" — write "changed Phase 3 confidence threshold
  from 0.95 to 0.90, because historical data showed Gate A triggering at
  18%, above the < 20% target"

Step 2: Run Gate 1 + Gate 2
  python tools/validate_workflow.py
  pytest tests/unit/ tests/integration/

Step 3: (MAJOR only) Run Gate 3
  python run_eval.py --cases eval/cases.yaml --output baseline_new.json
  python compare_eval.py baseline_current.json baseline_new.json

Step 4: Update version number
  Edit SKILL.md version field
  Add new version entry to CHANGELOG.md

Step 5: Release
  Merge changes; old version enters deprecated status
  Document any in-progress workflow runs using the old version
Enter fullscreen mode Exit fullscreen mode

CHANGELOG Template

# CHANGELOG

## v1.3.0 (2026-06-01)

### Changed
- Phase 3 confidence threshold: 0.95 → 0.90
  - Reason: historical Gate A trigger rate reached 18%, above the <20% target
  - Impact: ~5% of cases now proceed to Phase 4 instead of triggering Gate A

### Added
- Phase 4 collect-all strategy declared explicitly
  - Previous behavior was implicit; now documented as collect-all

## v1.2.1 (2026-05-15)

### Fixed
- Phase 7 Jira comment idempotency detection
  - Problem: inconsistent run_id format caused duplicate comments in some cases
  - Fix: standardized run_id format to "wf-{jira_key}-{date}"
Enter fullscreen mode Exit fullscreen mode

Design Checklist

File structure

  • [ ] Policy / Workflow / TaskSpec / Tool four-layer separation
  • [ ] config.yaml centralizes mutable parameters (timeouts, retry counts, model selection)
  • [ ] SKILL.md includes a version field

Gate 1 (static validation)

  • [ ] All template references exist on disk
  • [ ] All routing targets point to known phases or reserved keywords
  • [ ] Runs automatically in CI on every commit

Gate 2 (schema tests)

  • [ ] context_inputs align with upstream Phase output field tests
  • [ ] All routing condition edge cases have test coverage
  • [ ] Runs automatically in CI on every commit

Gate 3 (end-to-end regression)

  • [ ] Required for MAJOR version changes
  • [ ] Results compared against baseline; threshold violations block release

Version management

  • [ ] Every release updates SKILL.md version number
  • [ ] CHANGELOG documents the reason for changes, not just what changed

Summary

  1. Three gates, three speeds: static validation in seconds catches file reference errors, schema tests in minutes catch contract misalignments, end-to-end regression in hours catches behavior regressions — the first two handle most errors at low cost
  2. Version numbers distinguish behavior changes from safe updates: MAJOR changed routing or schema, handle in-progress runs; PATCH changed wording, old state files upgrade silently
  3. CHANGELOG documents reasons, not actions: "changed threshold from 0.95 to 0.9" is an action; "Gate A was triggering 18% of the time, above the 20% target" is the reason — six months later you only need the reason

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

Top comments (0)