Verivus OSS Releases

Posted on Apr 5

How We Used AI Agents to Security-Audit an Open Source Project

#typescript #ai #security #opensource

Using sqry's code graph, parallel audit agents, and iterative Codex review to contribute security improvements to gstack.

Garry Tan open-sourced gstack on March 11, 2026. It is a CLI toolkit for Claude Code with a headless browser, Chrome extension, skill system, and telemetry layer. The project attracted 30+ PR authors within its first few weeks.

We wanted to contribute something useful. Security review seemed like the right fit. A headless browser that spawns subprocesses and handles cookies has a large attack surface, and security work tends to fall to the bottom of every fast-moving project's priority list.

If you haven't read our earlier posts: sqry is an AST-based code search tool. It parses code like a compiler, building a graph of functions, classes, imports, and call relationships across 35+ languages. llm-cli-gateway orchestrates multiple LLMs (Claude, Codex, Gemini) through a single MCP interface. The Codex review gate is our practice of requiring unconditional Codex approval before shipping.

The Codebase

At the time of our audit (late March 2026, against the main branch as of March 30), gstack had about 47,000 symbols across 212 files in TypeScript, JavaScript, HTML, CSS, Shell, Ruby, JSON, and SQL. The browse subsystem's handleWriteCommand function was roughly 715 lines with a complexity score of 58. The Chrome extension injects into every page the user visits. The sidebar agent spawns Claude subprocesses from a JSONL queue file.

Running grep "exec" on this codebase returns 60+ matches. None of them look obviously wrong. Security review requires understanding relationships between functions, not just finding keywords.

Why grep Falls Short

In a previous article, I described why structural code search matters for this kind of work.

Say you want to find every path from user input to a dangerous sink like Bun.spawn(). grep finds the spawn calls. It does not tell you which functions call those functions, which HTTP endpoints call those functions, or whether any validation sits between the endpoint and the spawn.

sqry made this practical. For gstack, it built a graph of 46,837 nodes and 39,083 edges in 280ms. With all 36 language plugins enabled (including high-cost plugins like JSON and ServiceNow XML), the full graph captures 55,365 raw edges across 212 files.

sqry index . --force --include-high-cost

Files indexed:  212
Symbols:        46,837
Edges:          39,083 canonical (55,365 raw)
Plugins:        36 active
Build time:     280ms

Round 1: 10 Findings, 3 LLMs

Our first audit in March used three LLMs in separate roles. Claude and Codex each independently found overlapping but non-identical sets of issues. Gemini then verified all findings against source code. The total was 10 unique security findings across gstack's browse server, Chrome extension, design CLI, and telemetry layer. We submitted PR #664 with fixes and filed 10 public security issues (#665-#670, #672-#675). We disclosed publicly because gstack is a developer tool running locally, not a production service handling user data — the risk profile favors transparency over coordinated disclosure.

What gave us confidence these were real: three other contributors (stedfn, Gonzih, and mehmoodosman) independently found at least 6 of the same issues through separate analysis. Based on the public timeline, their PRs were filed after our issues and showed no references to our reports, suggesting independent discovery. Convergence from different methods and different people is strong validation.

Round 2: 20 More Findings

For the second audit, we expanded the approach. We dispatched 4 parallel audit agents instead of manually querying sqry:

Agent 1: server.ts, covering HTTP endpoints, auth, and CORS
Agent 2: write-commands.ts, the highest-complexity function, covering file ops and cookie handling
Agent 3: meta-commands.ts, covering command parsing, state management, and frame targeting
Agent 4: extension/, covering the Chrome extension sidepanel, inspector, and background worker

Each agent had full sqry MCP access with instructions to look for issues beyond the 10 we had already reported. They returned 25 raw findings. After cross-referencing against 20+ existing community issues and the maintainer's own security work (he had already landed two security-focused PRs), 16 were new. Four more gaps turned up during implementation review. The severity classifications below are ours, based on our assessment of impact and prerequisites — the maintainer may classify them differently.

A Subtle but Serious Finding

# bin/gstack-learnings-search, lines 46-52
cat "${FILES[@]}" 2>/dev/null | bun -e "
const type = '${TYPE}';
const query = '${QUERY}'.toLowerCase();
const limit = ${LIMIT};
const slug = '${SLUG}';

Bash variables are interpolated directly into JavaScript string literals via bun -e. A branch name containing a single quote, like fix'; process.exit(1); //, would break out of the JS string and execute arbitrary code. Easy to write, hard to spot in review.

The fix: pass parameters via environment variables instead of string interpolation.

cat "${FILES[@]}" 2>/dev/null | \
  GSTACK_FILTER_TYPE="$TYPE" \
  GSTACK_FILTER_QUERY="$QUERY" \
  GSTACK_FILTER_LIMIT="$LIMIT" \
  bun -e "
const type = process.env.GSTACK_FILTER_TYPE || '';
const query = (process.env.GSTACK_FILTER_QUERY || '').toLowerCase();
const limit = parseInt(process.env.GSTACK_FILTER_LIMIT || '10', 10) || 10;

Environment variables are never interpreted as code. The injection vector disappears.

A Finding sqry Made Possible

sqry's find_cycles tool detected a mutual recursion between switchChatTab and pollChat in the Chrome extension's sidepanel:

switchChatTab -> pollChat -> switchChatTab (cycle depth: 2)

pollChat fetches the server's active tab ID. If it differs from the client's, it calls switchChatTab. switchChatTab sets state and immediately calls pollChat. If the server keeps returning a different tab ID during rapid switching, this creates unbounded stack recursion.

grep alone will not reveal this relationship. The bug lives in the interaction between two functions, and that interaction only becomes visible in the call graph.

The Full List

We classified findings on a four-level scale: HIGH means an attacker can execute arbitrary code or exfiltrate data with minimal prerequisites. MED-HIGH means significant impact but requiring local access or a specific precondition. MED means the issue requires local access, specific conditions, or produces limited impact. LOW covers hardening gaps and defense-in-depth improvements.

#	Severity	Finding
1	HIGH	Shell injection via bash-to-JS interpolation
2	MED-HIGH	Queue file permissions allow local prompt injection
3	MED	`/health` endpoint exposes user activity without auth
4	MED	ReDoS via `new RegExp(userInput)` in frame targeting
5	MED	`chain` command bypasses watch-mode write guard
6	MED	`cookie-import` allows cross-domain cookie planting
7	MED	CSS values unvalidated at 4 injection points
8	MED	Session directory traversal via crafted `active.json`
9	MED	`responsive` screenshots skip path validation
10	MED	`validateOutputPath` uses `path.resolve`, not `realpathSync`*
11	MED	`state load` navigates to unvalidated URLs
12	MED	DOM serialization round-trip enables XSS on tab switch
13	MED	`switchChatTab`/`pollChat` mutual recursion
14	MED	`cookie-import-browser --domain` accepts unvalidated input
15-20	LOW	Info disclosure, timeout handling, bounds validation, prompt injection surface

*Finding 10 is a common pattern worth highlighting:

// BEFORE: resolves logically, symlinks pass through
const resolved = path.resolve(filePath);  // /tmp/safe -> still "/tmp/safe"

// AFTER: resolves physically, symlinks followed to real target
const resolved = realpathSync(filePath);  // /tmp/safe -> "/etc/shadow" (blocked!)

A symlink at /tmp/safe pointing to /etc would pass path.resolve validation but fail realpathSync, because the real path is outside the safe directory.

The Codex Review Gate

In a previous article, I described how we use Codex as a mandatory review gate. Unconditional approval or the work does not ship. Codex earned this role through specificity. Where a generic reviewer might say "consider improving error handling," Codex pinpoints "the catch block on line 47 swallows errors silently." It also has a low false-positive rate, which keeps the gate credible over time.

For this security plan, Codex went through 9 rounds before approving. That says more about our work than the tool. Three examples of what it caught:

Round 2: Our queue validator used string for tabId when the actual writer emits number. A type mismatch that would have caused the validator to reject every real queue entry.
Round 5: null values (which the real writer produces for optional fields) would be rejected by our schema. The validator was correct in theory but wrong against the actual data format.
Round 8: Our test extracted a 1500-character slice from the source file to validate against. That slice bled into adjacent functions, meaning the test could pass even without the fix being applied. The final solution: a brace-walking function body extractor that isolates exactly the target function.

Each round made the plan more precise. The full 9-round breakdown is in the PR #806 discussion. The discipline of submitting to review — and actually fixing what is found — is where the quality comes from.

Implementation: Subagent-Driven Development

With an approved plan, we dispatched one implementation subagent per task, 18 tasks total. Each subagent:

Read the specific source files
Created failing tests
Implemented the fix
Verified tests pass
Committed

A mid-implementation code review by a separate review agent caught 4 additional gaps we had missed:

applyStyle in the extension was missing the same CSS validation added to 3 other injection points
snapshot.ts still used the old path.resolve pattern
stateFile in queue entries had no path traversal check
cookie-import's read path validation used the old pattern

All fixed before continuing. That is why you review.

Test Results

Security regression tests: 119 pass, 0 fail [47ms]
E2E evals (Docker + Chromium): 33 pass, 0 regressions
Previously-failing browse tests: all 3 now pass

The E2E evals ran inside a Docker container (Ubuntu 24.04, Chromium 145, Playwright 1.58.2, --cap-add SYS_ADMIN for the Chromium sandbox). One test outside the security suite (qa-bootstrap) failed due to test infrastructure — it is not included in the 33 count above.

How It Landed

On April 6, the maintainer cherry-picked both our first round (PR #664) and second round (PR #806) onto the garrytan/security-wave-5 branch with co-author credit. They are part of PR #847, which bundles fixes from 8 community PRs across 4 contributors. That PR is open and under review at time of writing.

This did not happen immediately. On April 5, the maintainer merged PR #810 ("security wave 1"), which cherry-picked fixes from Gonzih and garagon — contributors who had independently found several of the same issues we reported in our round 1 issues (#665-#670, #672-#675), filed on March 30. At that point our PRs were still open without comment.

We flagged four gaps in that initial wave:

validateOutputPath was only fixed in one of three copies. The identical vulnerable function in meta-commands.ts and inline validation in snapshot.ts still used path.resolve without realpathSync.
The fix broke on macOS. SAFE_DIRECTORIES contained /tmp, but on macOS /tmp is a symlink to /private/tmp. realpathSync resolves through it, causing legitimate screenshots to be rejected.
No queue entry schema validation. File permissions were added, but queue entry contents were not validated against type checks or path traversal.
/health still leaked user activity. The unauthenticated response returned the user's current URL and sidebar AI message text.

All four gaps are addressed in the security wave 5 PR. The maintainer included garagon's #820 (symlink resolution in meta-commands), our queue validation and /health fixes from #806, and the full set of CSS injection guards, cookie domain validation, reentrancy guards, and SIGKILL escalation across both our rounds.

The PR summary lists 20 security fixes with 750+ lines of new regression tests, attributed jointly to "@mr-k-man, @garagon." Most of those 20 fixes came from our two PRs (#664 and #806). garagon contributed three — shell injection env vars (#819), meta-commands symlink resolution (#820), and upload path validation (#821) — two of which address issues we originally reported. The commit history in #847 shows separate cherry-picks for each source PR.

The timeline is common in open source security work. We filed issues and PRs on March 30. Other contributors independently found overlapping issues. The maintainer triaged and cherry-picked fixes in waves over 7 days, starting with the most urgent. Our work was picked up last but included completely, with co-author attribution. Open source security work often lands asynchronously and in waves. Thorough reports with working patches tend to get recognized, even when the initial response is silence.

The Toolkit

Everything described here uses two open-source tools:

sqry: AST-based semantic code search. Builds a graph of symbols and relationships across 35+ languages. Exposes 34 MCP tools for AI agents to navigate code structurally.

llm-cli-gateway: Multi-LLM orchestration via MCP. Routes requests through Claude, Codex, and Gemini with session continuity, async job management, and approval gates.

Both are MIT-licensed. sqry runs entirely locally. llm-cli-gateway runs locally but routes requests to remote LLM APIs (Claude, Codex, Gemini).

What We Learned

Independent convergence validates methodology. When other contributors find the same issues through completely different methods, you can trust the results.

Rigorous review improves your own work most of all. 9 rounds of Codex review sounds like a lot. It was. Every round caught something real. The discipline of submitting to review, and actually fixing what is found, is where the quality comes from.

Structural search finds what text search misses. The switchChatTab/pollChat recursion, the validateOutputPath symlink bypass, the CSS injection across 4 separate code paths — these are relationship issues. Understanding code structure is different from searching code text.

Security review is a good way to serve the open source community. Every maintainer has more feature requests than they can handle. A thorough security review with fixes, tests, and documentation is work that helps everyone who uses the project. We are grateful gstack is open source and that we could contribute.

The full security audit report, implementation plan, and all test results are in PR #806. The round 1 report is in PR #664.

sqry: github.com/verivus-oss/sqry
llm-cli-gateway: github.com/verivus-oss/llm-cli-gateway