wfguard: a GitHub Actions supply-chain auditor

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

What I Built

wfguard is a Go CLI that audits GitHub Actions workflows for supply-chain attack patterns. It combines two engines:

A deterministic Go pass that catches what regex catches: pwn-request triggers, mutable-tag pins from unverified publishers, missing permissions: blocks, references to known-compromised actions. I seeded the tj-actions/changed-files 2025 incident in the known-bad list.
A Gemma 4 agent loop that catches what regex can't: cross-step taint flow, action-source review, severity calls that depend on the workflow's trigger surface.

The tool outputs a Markdown report, SARIF 2.1.0 for GitHub's code-scanning UI, and with --harden, a unified diff. You run wfguard scan ./repo --harden, then apply the patch with git apply report.patch. Gemma 4 produces the corrected file. The tool validates it parses as YAML before including it in the patch. wfguard puts only changed files in the diff.

A real CI workflow in the wild had this snippet:

- run: |
    export VERSION="${GITHUB_REF#refs/tags/v}"
    sed -i "s/version=.*/version=\"${VERSION}\",/" setup.py

$GITHUB_REF for a release trigger is refs/tags/<tagname>. Git tag names accept most ASCII characters. Push a tag named v"; rm -rf / # and the sed runs that as a shell payload. My static rules didn't catch it: no ${{ ... }} interpolation to anchor on, just a runner env var threading through string interpolation into bash. Static analysis can't see this. The agent can.

Three design choices:

submit_finding is the agent's only output channel. The tool ignores anything the model says outside a tool call. Structured output without strict JSON mode. The model can't hallucinate findings outside the schema.
Default --min-severity high. wfguard computes hygiene findings (unpinned actions/* tags, missing permissions: blocks) but hides them by default. Most workflow scanners drown users in these. The LLM agent uses them as context; the human sees them only with --min-severity low.
UnpinnedRule is narrow. It fires for unverified publishers or actions with a known compromise history. actions/checkout@v4 is fine. random-vendor/some-tool@v1 is not. The OpenSSF "pin everything to a SHA" advice is correct in theory but produces ~80% noise on real repos.

Code

https://github.com/nshekhawat/wfguard

How I Used Gemma 4

Primary model: Gemma 4 31B Dense (gemma-4-31b-it).

I picked 31B for three model properties and one problem property:

256K context. Workflow YAML is small. Referenced action source code is large. Calling get_action_source('actions/checkout@v4') returns action.yml plus dist/index.js, which can be hundreds of KB of bundled JavaScript. A 32K model would be useless. An 8K model would force me to pre-summarize, which defeats the point of letting the model read.
Strongest dense reasoning in the Gemma 4 family. Multi-hop taint analysis is where smaller models stop being useful. The $GITHUB_REF → sed finding requires reasoning across three steps: the tag arrives in an env var, bash interpolates it into a sed argument, the shell runs it. Each step is trivial in isolation. The chain is where 31B matters.
Native function calling. wfguard's agent has seven tools: list_workflows, get_workflow, get_action_source, resolve_reference, lookup_advisories, trace_expression_flow, submit_finding. The model picks tools, my Go dispatcher executes them, results go back as function_response parts. Strict JSON-mode workarounds would have been more code to get right.

Comparison I ran: Gemma 4 E4B (gemma-4-e4b-it-mlx) via LM Studio.

I built wfguard backend-neutral. A Generator interface has two implementations: one for the Gemini API, one for any OpenAI-compatible server (LM Studio, vLLM, llama.cpp, Ollama, Unsloth). Everything stays the same except the wire format. Switching models is one --backend flag.

E4B works. It calls tools and produces valid SARIF. Three weaknesses:

Decisiveness. E4B keeps calling tools past the point where it should stop. With --max-steps 5, it hits the limit. 31B returns a clean no-tool-call turn in 4-7 steps.
Hardening fidelity. E4B's hardening output drops unrelated comments. The security fix is correct, but the user loses context. 31B keeps the comments.
Cross-step reasoning. The $GITHUB_REF → sed finding came from 31B. E4B didn't surface it on the same workflow.

The trade-off works: E4B for a free local pass during development (single workflow in ~2 minutes on an M-series Mac, no API spend), 31B for production hardening.

E2B: not used. The smallest variant doesn't have the context for action-source reading and would force a redesign of the tool set.

26B-A4B MoE: not benchmarked here. Natural follow-up: with ~4B active params it should land between E4B and 31B on cost and quality. One --backend flag to compare.

The hardening pass uses Gemma 4 in a different mode. The audit loop is tool-calling. The hardener is codegen: "here's a workflow YAML and a list of confirmed findings; produce a corrected version, output only YAML, no fences, no prose." wfguard diffs the output against the original and emits a unified patch. 31B's code-generation strength matters most here. The model writes YAML that has to round-trip through yaml.Unmarshal and git apply without breaking. Most outputs do.

DEV Community

wfguard: a GitHub Actions supply-chain auditor

What I Built

Code

How I Used Gemma 4

Top comments (0)