What I Built
wfguard is a Go CLI that audits GitHub Actions workflows for supply-chain attack patterns. It combines two engines:
A deterministic Go pass that catches what regex catches: pwn-request triggers, mutable-tag pins from unverified publishers, missing
permissions:blocks, references to known-compromised actions. I seeded thetj-actions/changed-files2025 incident in the known-bad list.A Gemma 4 agent loop that catches what regex can't: cross-step taint flow, action-source review, severity calls that depend on the workflow's trigger surface.
The tool outputs a Markdown report, SARIF 2.1.0 for GitHub's code-scanning UI, and with --harden, a unified diff. You run wfguard scan ./repo --harden, then apply the patch with git apply report.patch. Gemma 4 produces the corrected file. The tool validates it parses as YAML before including it in the patch. wfguard puts only changed files in the diff.
A real CI workflow in the wild had this snippet:
- run: |
export VERSION="${GITHUB_REF#refs/tags/v}"
sed -i "s/version=.*/version=\"${VERSION}\",/" setup.py
$GITHUB_REF for a release trigger is refs/tags/<tagname>. Git tag names accept most ASCII characters. Push a tag named v"; rm -rf / # and the sed runs that as a shell payload. My static rules didn't catch it: no ${{ ... }} interpolation to anchor on, just a runner env var threading through string interpolation into bash. Static analysis can't see this. The agent can.
Three design choices:
-
submit_findingis the agent's only output channel. The tool ignores anything the model says outside a tool call. Structured output without strict JSON mode. The model can't hallucinate findings outside the schema. -
Default
--min-severity high. wfguard computes hygiene findings (unpinnedactions/*tags, missingpermissions:blocks) but hides them by default. Most workflow scanners drown users in these. The LLM agent uses them as context; the human sees them only with--min-severity low. -
UnpinnedRuleis narrow. It fires for unverified publishers or actions with a known compromise history.actions/checkout@v4is fine.random-vendor/some-tool@v1is not. The OpenSSF "pin everything to a SHA" advice is correct in theory but produces ~80% noise on real repos.
Code
https://github.com/nshekhawat/wfguard
How I Used Gemma 4
Primary model: Gemma 4 31B Dense (gemma-4-31b-it).
I picked 31B for three model properties and one problem property:
-
256K context. Workflow YAML is small. Referenced action source code is large. Calling
get_action_source('actions/checkout@v4')returnsaction.ymlplusdist/index.js, which can be hundreds of KB of bundled JavaScript. A 32K model would be useless. An 8K model would force me to pre-summarize, which defeats the point of letting the model read. -
Strongest dense reasoning in the Gemma 4 family. Multi-hop taint analysis is where smaller models stop being useful. The
$GITHUB_REF → sedfinding requires reasoning across three steps: the tag arrives in an env var, bash interpolates it into asedargument, the shell runs it. Each step is trivial in isolation. The chain is where 31B matters. -
Native function calling. wfguard's agent has seven tools:
list_workflows,get_workflow,get_action_source,resolve_reference,lookup_advisories,trace_expression_flow,submit_finding. The model picks tools, my Go dispatcher executes them, results go back asfunction_responseparts. Strict JSON-mode workarounds would have been more code to get right.
Comparison I ran: Gemma 4 E4B (gemma-4-e4b-it-mlx) via LM Studio.
I built wfguard backend-neutral. A Generator interface has two implementations: one for the Gemini API, one for any OpenAI-compatible server (LM Studio, vLLM, llama.cpp, Ollama, Unsloth). Everything stays the same except the wire format. Switching models is one --backend flag.
E4B works. It calls tools and produces valid SARIF. Three weaknesses:
-
Decisiveness. E4B keeps calling tools past the point where it should stop. With
--max-steps 5, it hits the limit. 31B returns a clean no-tool-call turn in 4-7 steps. - Hardening fidelity. E4B's hardening output drops unrelated comments. The security fix is correct, but the user loses context. 31B keeps the comments.
-
Cross-step reasoning. The
$GITHUB_REF → sedfinding came from 31B. E4B didn't surface it on the same workflow.
The trade-off works: E4B for a free local pass during development (single workflow in ~2 minutes on an M-series Mac, no API spend), 31B for production hardening.
E2B: not used. The smallest variant doesn't have the context for action-source reading and would force a redesign of the tool set.
26B-A4B MoE: not benchmarked here. Natural follow-up: with ~4B active params it should land between E4B and 31B on cost and quality. One --backend flag to compare.
The hardening pass uses Gemma 4 in a different mode. The audit loop is tool-calling. The hardener is codegen: "here's a workflow YAML and a list of confirmed findings; produce a corrected version, output only YAML, no fences, no prose." wfguard diffs the output against the original and emits a unified patch. 31B's code-generation strength matters most here. The model writes YAML that has to round-trip through yaml.Unmarshal and git apply without breaking. Most outputs do.
Top comments (0)