Jo Moore

Posted on Apr 27 • Originally published at github.com

How we self-pentested ciguard — Cycle 1: four findings, four advisories, two days

#security #devsecops #opensource #python

4 findings. 4 GHSAs. 4 CVEs requested. Same-day disclosure. v0.8.2 ships with the fixes. v0.8.3 wires the four PoCs in as permanent CI regression gates so the bugs cannot silently return. Total elapsed: ~48 hours. Total cost: $0.30 in cloud spend.

ciguard, briefly

ciguard is a static security auditor for CI/CD pipelines — GitLab CI, GitHub Actions, and Jenkins, plus cross-platform SCA. It ships as pip install ciguard, a multi-arch Docker image, and an MCP server you can plug into Claude Desktop, Claude Code, or Cursor. 44 deterministic rules across three platforms, 17 built-in policies, four output formats including SARIF 2.1.0 with native baseline diffing.

It went public on PyPI on 2026-04-25. The next day, it pentested itself.

This is the writeup of that engagement.

Why self-pentest a security tool?

Two reasons, neither of them tactical:

One — the credibility cost of finding a bug in a security tool is multiplicative. Users assume a security tool is itself secure. The moment a CVE lands against ciguard, the question every potential adopter asks is "if they couldn't keep their own code clean, why would I trust their findings about mine?" The way you avoid that is by surfacing the bugs yourself, in public, with a methodology that holds up to scrutiny.

Two — pre-adoption-traction is exactly when self-pentest is cheapest. ciguard had ~zero installs at v0.8.1 ship. No users are exposed by a Critical finding I'd have surfaced too late. Cost = my time + ~$0.30 in cloud spend (ephemeral droplet, destroyed at cycle close). Compare that to the cost once we have real users in real CI pipelines.

So: a recurring 6-monthly cadence, starting now, scoped to the surfaces ciguard actually exposes. CREST-aligned methodology because if I ever bring in an external reviewer, the report shouldn't get dismissed as "ad-hoc."

The methodology

PTES structure (Phases 0–7) for the engagement timeline. OWASP Testing Guide v4.2 for the execution checklist. CREST framing for the report so the format reads as a professional engagement, not a self-cert.

The lab: ephemeral DigitalOcean droplet provisioned by Terraform, Ubuntu 24.04, full toolchain (atheris, ZAP, semgrep, Trivy, ffuf, gobuster, nmap) installed via cloud-init in 3-5 minutes. make up CYCLE=1 to bring it up; make destroy && make nuke at close-out. Total cycle cost: $0.30 of a $50 cap. The droplet is destroyed at cycle close; Cycle 2 reprovisions from scratch with whatever the latest tool versions are then.

Why cloud-ephemeral instead of Kali-in-UTM (the original plan)? Three reasons: my Mac doesn't have 60 GB free for a VM; "attacker tools never run on the target machine" is honoured by cloud separation; and reprovisioning Cycle 2 in October from the same Terraform doesn't depend on a snapshot that's drifted six months.

Scope: six of ciguard's eight surfaces — CLI, parsers, FastAPI Web UI, reporters, baseline+delta, MCP server. Out of scope: the GitHub App and hosted SaaS, neither of which has shipped yet — they get assessed when they do. Supply chain is out of scope because it's already covered by Trivy + OIDC publishing; social engineering and physical access are out of scope by definition for a one-person open-source project.

What we found

Four findings — one Medium, three Low. No Critical or High.

Finding	Severity	CVSS v4.0	GHSA
CYCLE-1-001 — `discover_pipeline_files` follows symlinks out of scan root	Medium	5.7	GHSA-8cxw-cc62-q28v
CYCLE-1-002 — Container image runs as root	Low	3.4	GHSA-jrm4-4pcf-4763
CYCLE-1-003 — SCA HTTP client reads response body unbounded	Low	3.1	GHSA-xw8c-rrvx-f7xq
CYCLE-1-004 — Web UI missing HTTP defence-in-depth headers	Low (Medium hosted)	4.3	GHSA-7ww3-xvf5-cxwm

The interesting one is CYCLE-1-001. ciguard ships an MCP server. One of the tools it exposes is scan_repo, which walks a directory and audits any pipeline files it finds. The threat scenario writes itself: an AI agent gets fed an adversarial prompt — "Scan /tmp/cloned-suspicious-repo for pipeline issues" — and the cloned-untrusted-source repo contains symlinks pointing at ~/.aws/, ~/.ssh/, or /etc/some-secret-pipeline/. Discovery walks the symlinks, returns the symlink-target paths and their contents to the AI, which faithfully reports the "findings" back. Pipeline files often contain hardcoded secrets, internal hostnames, deploy keys.

Confused-deputy via MCP. Realistic in 2026 in a way it wasn't a year ago. Mitigated in v0.8.2 with follow_symlinks=False as the new default for the discovery walker, plus a belt-and-braces filter that drops any result whose .resolve() lies outside the scan root.

The other three — container as root, unbounded HTTP read, and missing defence-in-depth headers — are bread-and-butter findings that any CREST-style engagement would surface. Useful catches that strengthen the posture; nothing exotic.

What didn't find anything (also worth saying)

ciguard had passed Bandit + pip-audit + Trivy gates on every commit since v0.1.4 (2026-04-25). That accumulated discipline showed up in what the cycle didn't find:

Atheris coverage-guided fuzzing — 220k iterations across all three parsers (GitLabCIParser, GitHubActionsParser, hand-rolled JenkinsfileParser). Zero crashes, zero hangs, memory ceiling 65 MB. The Jenkinsfile parser was the highest-risk surface per threat model (no upstream parser to inherit hardening from); it survived clean.
Stored XSS via Jinja template injection — the HTML reporter renders findings into a template. Carefully crafted pipeline content with <script>alert(1)</script> payloads. Auto-escape is on; the payload renders as <script>alert(1)</script> inside a <pre> block. Defence-in-depth working.
YAML deserialisation — every loader is yaml.SafeLoader, no exceptions. pickle and eval and exec of user input — zero occurrences in src/.
MCP gate (CIGUARD_MCP_DISABLED) truthy parsing — 25-case battery covering canonical truthy, mixed case, whitespace, falsy, empty, substring confusion (yesno, true_thing), numeric coincidence (01, 00, 2), literal escapes. All correctly classified.

Documenting what passed matters because it forms the regression watchlist for future cycles. These are the things to re-check first when a refactor lands.

The disclosure decision

Standard responsible-disclosure convention: 14-day window between fix-ship and public advisory. v0.8.2 shipped 2026-04-27. Standard would publish the GHSAs on 2026-05-11.

I published all four same-day instead.

The 14-day window protects affected users — it gives them time to upgrade between the patch landing and the vulnerability becoming public. The standard policy assumes there is a user base whose safety would be compromised by faster disclosure.

ciguard had ~zero downloads at v0.8.1 ship. No affected users → no safety upside to waiting → standard policy adapts to publish promptly. v0.8.2 has been the default pip install ciguard since 2026-04-27; anyone landing on the project from this point forward gets the fixed version.

The credibility upside of immediate disclosure was the deciding factor: a public set of GHSAs (with CVEs requested, propagating to NVD over the following days) is the strongest possible artefact of "we self-pentested and shipped the fixes the same day." Withholding them for two weeks would have delayed exactly the thing that makes a self-pentest valuable as a credibility signal.

The decision is documented in the Cycle 1 final report's Definition of Done section. By Cycle 2 (October 2026), ciguard may have actual users, and the standard 14-day window applies again. This is a judgement call per cycle, not a permanent policy.

Closing the loop — making sure these don't come back

Cycle 1's final-report recommendations included two CI hooks that just shipped in v0.8.3:

Recommendation #2 — wire the four PoC scripts in as CI regression gates. Each of the four PoCs has a binary outcome encoded in its exit code: 0 = EXPLOIT_CONFIRMED, 1 = EXPLOIT_FAILED. v0.8.3 adds tests/regression/cycle1/ holding the four scripts as live regression copies, plus a new regression-cycle1 job in the reusable _checks.yml workflow that runs all four on every push, every PR, and every release tag. Inverts each script's exit code so the build fails only when a regression appears. The container PoC builds the image locally first so the gate fires before publish, not after.

The unit tests added in v0.8.2 cover the fixes as code paths. The PoC regression gates cover the exploit chains end-to-end. A future refactor that breaks the security guarantee in a different layer can pass the unit tests and fail the PoC — that's the property we want.

Recommendation #3 — schedule a weekly atheris fuzz cron. v0.8.3 adds .github/workflows/atheris-fuzz.yml running 1M iterations of coverage-guided fuzz across all three parsers every Sunday 06:00 UTC. Per-input timeout 10s, total budget 30 minutes. Crash uploads the input as a 30-day artifact and opens a security/fuzz-finding issue. Manual workflow_dispatch accepts a custom iteration count for spot runs.

Cycle 1 ran 220k iterations in ~2 minutes and surfaced no crashes; weekly 1M is cheap insurance against regressions when new rules or parser refactors land.

What's next

Cycle 2 is scheduled for 2026-10-28 → 2026-11-11 (six-monthly cadence). The Terraform lab gets reprovisioned from scratch with then-current tool versions; the report template clones from Cycle 1's. Same Definition of Done, applied to whatever surfaces have shipped by then (likely the v0.9.0 GitHub App + ciguard scan-repo CLI).

The four GHSAs are public as of today. CVE numbers from MITRE are working their way through GitHub's CNA pipeline and should attach to the advisories over the next few days; once they propagate, the GHSAs auto-decorate with CVE-2026-NNNN IDs.

If you're maintaining a security tool — or any open-source tool with a non-trivial attack surface — and you've been waiting for the "right moment" to do a self-pentest: do it before you have users. The cost is your time and a few dollars. The upside is a defensible posture that doesn't depend on nobody bothering to look.

DEV Community