From Single Files to Scenario Suites: Batch Validation in the OWASP Agent Security Regression Harness

#security #owasp #python #testing

Security regression testing is most valuable when it can cover your entire suite of scenarios in one command - not just a single file at a time. I recently contributed a change to the OWASP Agent Security Regression Harness that extends the validate command to accept files, directories, and glob patterns. The PR was merged with green CI. Here is what the problem was, what changed, and what it teaches about scoped open-source contributions.

What the Harness Does

The OWASP Agent Security Regression Harness is a framework for writing and running executable security regression tests for agentic applications and MCP-integrated systems. You define scenarios as YAML files describing security-relevant goals, expected behaviors, and assertions. The harness runs those scenarios against your agent and reports pass or fail results.

The validate command checks that a scenario file is structurally correct before you try to run it. Catching a malformed scenario early - before it silently passes CI - is exactly the kind of hygiene that prevents false confidence in a security regression suite.

The Problem: One File at a Time

Before this change, validate accepted a single file path:

agent-harness validate scenarios/goal_hijack/basic.yaml

That works fine with one scenario. Real suites grow. A project might have dozens of scenario files across multiple subdirectories, covering prompt injection, goal hijacking, secret disclosure, and other attack patterns. Validating each file individually does not scale - and more importantly, a CI job that only validates one file gives you incomplete coverage.

The practical risk: an invalid scenario sitting in a directory that was never validated can produce misleading results. A scenario that fails to parse might be skipped silently, reducing the effective coverage of your regression suite without any visible signal.

What Changed

PR #150 extended validate to accept one or more files, directories, or glob patterns:

# Validate one file
agent-harness validate scenarios/goal_hijack/basic.yaml

# Validate every scenario in a directory (recursive)
agent-harness validate scenarios/

# Validate a glob pattern
agent-harness validate "scenarios/**/*.yaml"

The command prints one line per scenario, a summary, and exits non-zero if any scenario is invalid - making it usable as a CI gate.

valid: goal_hijack_basic
valid: prompt_injection_system
invalid: secret_disclosure_draft - field 'assertions' is required
---
2 valid, 1 invalid

The core of the implementation is a _discover_scenario_files function that handles the three input cases cleanly:

def _discover_scenario_files(patterns: list[str]) -> list[Path]:
    """Return unique scenario files matched by files, directories, or globs."""
    scenario_files: list[Path] = []
    seen: set[Path] = set()

    for pattern in patterns:
        path = Path(pattern)
        if path.is_dir():
            matches = sorted(
                matched
                for suffix in ("*.yaml", "*.yml")
                for matched in path.rglob(suffix)
                if matched.is_file()
            )
        else:
            glob_matches = sorted(
                Path(match) for match in glob.glob(pattern, recursive=True)
            )
            matches = glob_matches if glob_matches else [path]

        for match in matches:
            normalized = match.resolve()
            if normalized in seen or not match.is_file():
                continue
            seen.add(normalized)
            scenario_files.append(match)

    return scenario_files

Deduplication via resolved paths prevents a scenario from being validated twice when a glob pattern and an explicit file path both match the same file.

The nargs="+" change to the argument parser is the smallest part of the diff, but it is what makes the whole thing composable in CI:

- name: Validate all security scenarios
  run: agent-harness validate "scenarios/**/*.yaml"

The Review

The maintainer accepted the change with one round of review. The scope was deliberate: no unrelated refactors, no new dependencies, tests for each input mode (single file, recursive directory, mixed-validity glob), and documentation in the GitHub Actions guide. Keeping a contribution scoped to what the issue actually asked for is what makes a PR reviewable quickly.

CI ran lint (ruff), type checking (mypy), and the full test suite - 333 tests, 1 skipped, all scenario validation checks passing.

The Lesson

A validation command that only accepts one file at a time is not a CI-ready tool. The change is small - under 100 lines including tests - but it makes the harness meaningfully more useful for teams running scenario suites in automated pipelines.

If you work on security regression testing for AI systems or MCP-integrated applications, the OWASP Agent Security Regression Harness is worth looking at. The project has open issues for SARIF output, suite-level runners, and additional adapter support - all practical contributions with clear scope.

PR #150 on GitHub