DEV Community

Verivus OSS Releases
Verivus OSS Releases

Posted on

How We Used AI Agents to Security-Audit an Open Source Project

Using sqry's code graph, parallel audit agents, and iterative Codex review to contribute security improvements to gstack.


Garry Tan open-sourced gstack in late March 2026. It is a CLI toolkit for Claude Code with a headless browser, Chrome extension, skill system, and telemetry layer. The project attracted 30+ contributors within days.

We wanted to contribute something useful. Security review seemed like the right fit. A headless browser that spawns subprocesses and handles cookies has a large attack surface, and security work tends to fall to the bottom of every fast-moving project's priority list.

If you haven't read our earlier posts: sqry is an AST-based code search tool. It parses code like a compiler, building a graph of functions, classes, imports, and call relationships across 35+ languages. llm-cli-gateway orchestrates multiple LLMs (Claude, Codex, Gemini) through a single MCP interface. The Codex review gate is our practice of requiring unconditional Codex approval before shipping.

The Codebase

gstack has about 47,000 symbols across 212 files in TypeScript, JavaScript, HTML, CSS, Shell, Ruby, JSON, and SQL. The browse subsystem has a 750-line command handler with a complexity score of 59. The Chrome extension injects into every page the user visits. The sidebar agent spawns Claude subprocesses from a JSONL queue file.

Running grep "exec" on this codebase returns 60+ matches. None of them look obviously wrong. Security review requires understanding relationships between functions, not just finding keywords.

Why grep Falls Short

In a previous article, I described why structural code search matters for this kind of work.

Say you want to find every path from user input to a dangerous sink like Bun.spawn(). grep finds the spawn calls. It does not tell you which functions call those functions, which HTTP endpoints call those functions, or whether any validation sits between the endpoint and the spawn.

sqry does. For gstack, it built a graph of 46,837 nodes and 39,083 edges in 280ms. With all 36 language plugins enabled (including high-cost plugins like JSON and ServiceNow XML), the full graph captures 55,365 raw edges across 212 files.

sqry index . --force --include-high-cost

Files indexed:  212
Symbols:        46,837
Edges:          39,083 canonical (55,365 raw)
Plugins:        36 active
Build time:     280ms
Enter fullscreen mode Exit fullscreen mode

Round 1: 10 Findings, 3 LLMs

Our first audit in March used three LLMs independently. Claude and Codex each found overlapping but non-identical sets of issues. Gemini verified all findings against source. The total was 10 unique vulnerabilities across the browse subsystem. We submitted PR #664 with fixes and filed 10 issues (#665-#675).

What gave us confidence these were real: three other contributors (stedfn, Gonzih, and mehmoodosman) independently found at least 6 of the same issues through completely separate analysis. They had not seen our reports. Independent convergence on the same findings is strong validation.

Round 2: 20 More Findings

For the second audit, we expanded the approach. We dispatched 4 parallel audit agents instead of manually querying sqry:

  • Agent 1: server.ts, covering HTTP endpoints, auth, and CORS
  • Agent 2: write-commands.ts, the highest-complexity function, covering file ops and cookie handling
  • Agent 3: meta-commands.ts, covering command parsing, state management, and frame targeting
  • Agent 4: extension/, covering the Chrome extension sidepanel, inspector, and background worker

Each agent had full sqry MCP access with instructions to look for issues beyond the 10 we had already reported. They returned 25 raw findings. After cross-referencing against 20+ existing community issues and the maintainer's own security work (he had already landed two security-focused PRs), 16 were new. Four more gaps turned up during implementation review.

A Subtle but Serious Finding

# bin/gstack-learnings-search, lines 46-52
cat "${FILES[@]}" 2>/dev/null | bun -e "
const type = '${TYPE}';
const query = '${QUERY}'.toLowerCase();
const limit = ${LIMIT};
const slug = '${SLUG}';
Enter fullscreen mode Exit fullscreen mode

Bash variables are interpolated directly into JavaScript string literals via bun -e. A branch name containing a single quote, like fix'; process.exit(1); //, would break out of the JS string and execute arbitrary code. Easy to write, hard to spot in review.

The fix: pass parameters via environment variables instead of string interpolation.

cat "${FILES[@]}" 2>/dev/null | \
  GSTACK_FILTER_TYPE="$TYPE" \
  GSTACK_FILTER_QUERY="$QUERY" \
  GSTACK_FILTER_LIMIT="$LIMIT" \
  bun -e "
const type = process.env.GSTACK_FILTER_TYPE || '';
const query = (process.env.GSTACK_FILTER_QUERY || '').toLowerCase();
const limit = parseInt(process.env.GSTACK_FILTER_LIMIT || '10', 10) || 10;
Enter fullscreen mode Exit fullscreen mode

Environment variables are never interpreted as code. The injection vector disappears.

A Finding sqry Made Possible

sqry's find_cycles tool detected a mutual recursion between switchChatTab and pollChat in the Chrome extension's sidepanel:

switchChatTab -> pollChat -> switchChatTab (cycle depth: 2)
Enter fullscreen mode Exit fullscreen mode

pollChat fetches the server's active tab ID. If it differs from the client's, it calls switchChatTab. switchChatTab sets state and immediately calls pollChat. If the server keeps returning a different tab ID during rapid switching, this creates unbounded stack recursion.

grep will not surface this. The bug lives in the relationship between two functions, and that relationship only becomes visible in the call graph.

The Full List

# Severity Finding
1 HIGH Shell injection via bash-to-JS interpolation
2 MED-HIGH Queue file permissions allow local prompt injection
3 MED /health endpoint exposes user activity without auth
4 MED ReDoS via new RegExp(userInput) in frame targeting
5 MED chain command bypasses watch-mode write guard
6 MED cookie-import allows cross-domain cookie planting
7 MED CSS values unvalidated at 4 injection points
8 MED Session directory traversal via crafted active.json
9 MED responsive screenshots skip path validation
10 MED validateOutputPath uses path.resolve, not realpathSync*
11 MED state load navigates to unvalidated URLs
12 MED DOM serialization round-trip enables XSS on tab switch
13 MED switchChatTab/pollChat mutual recursion
14 MED cookie-import-browser --domain accepts unvalidated input
15-20 LOW Info disclosure, timeout handling, bounds validation, prompt injection surface

*Finding 10 is a common pattern worth highlighting:

// BEFORE: resolves logically, symlinks pass through
const resolved = path.resolve(filePath);  // /tmp/safe -> still "/tmp/safe"

// AFTER: resolves physically, symlinks followed to real target
const resolved = realpathSync(filePath);  // /tmp/safe -> "/etc/shadow" (blocked!)
Enter fullscreen mode Exit fullscreen mode

A symlink at /tmp/safe pointing to /etc would pass path.resolve validation but fail realpathSync, because the real path is outside the safe directory.

The Codex Review Gate

In a previous article, I described how we use Codex as a mandatory review gate. Unconditional approval or the work does not ship. Codex earned this role through specificity. Where a generic reviewer might say "consider improving error handling," Codex pinpoints "the catch block on line 47 swallows errors silently." It also has a low false-positive rate, which keeps the gate credible over time.

For this security plan, Codex went through 9 rounds before approving. Honestly, this says more about our work than the tool. It kept finding legitimate gaps in our plan:

Round Verdict What Codex caught
1 REQUEST_CHANGES 9 items: missing tasks, incomplete validation, stale line numbers
2 REQUEST_CHANGES 4 items: phantom files that do not exist, wrong field types
3 REQUEST_CHANGES 2 items: test too broad, tabId should be number not string
4 REQUEST_CHANGES 2 items: undefined variable in test, field assertions incomplete
5 REQUEST_CHANGES 2 items: null handling missing, wrong command name
6 REQUEST_CHANGES 2 items: test could pass without the fix, summary stale
7 REQUEST_CHANGES 1 item: regex matches unrelated code elsewhere in file
8 REQUEST_CHANGES 1 item: fixed-width text slice bleeds into adjacent functions
9 APPROVE Brace-walking function body extractor

Round 2 caught that our queue validator used string for tabId when the actual writer emits number. Round 5 caught that null values (which the real writer produces) would be rejected. Round 8 caught that our test's 1500-character slice could accidentally include adjacent code, satisfying assertions without the fix being correct.

Each round made the plan more precise. By round 9, every test was scoped to the exact function body it validates, every type matched the real writer, and every file in scope had a corresponding task. Even when you feel confident, a rigorous reviewer makes the work better.

Implementation: Subagent-Driven Development

With an approved plan, we dispatched one implementation subagent per task, 18 tasks total. Each subagent:

  1. Read the specific source files
  2. Created failing tests
  3. Implemented the fix
  4. Verified tests pass
  5. Committed

A mid-implementation code review by a separate review agent caught 4 additional gaps we had missed:

  • applyStyle in the extension was missing the same CSS validation added to 3 other injection points
  • snapshot.ts still used the old path.resolve pattern
  • stateFile in queue entries had no path traversal check
  • cookie-import's read path validation used the old pattern

All fixed before continuing. That is why you review.

Test Results

Security regression tests: 119 pass, 0 fail [47ms]
E2E evals (Docker + Chromium): 33 pass, 0 regressions
Previously-failing browse tests: all 3 now pass
Enter fullscreen mode Exit fullscreen mode

The E2E evals ran inside a Docker container (Ubuntu 24.04, Chromium 145, Playwright 1.58.2, --cap-add SYS_ADMIN for the Chromium sandbox). One unrelated test (qa-bootstrap) failed due to test infrastructure.

A Note on Attribution and Completeness

On April 5, the maintainer landed PR #810, titled "security wave 1," which fixed 14 issues from seanomich's audit (#783). Several of those fixes address the same vulnerabilities we reported in our round 1 issues (#665-#675), filed on March 30.

The fixes were cherry-picked from Gonzih (#743, #744, #745, #750, #751) and garagon (#803), who independently found and fixed the same bugs days after our issues were filed. Our PR #664 and 10 issues remain open without comment.

We are glad the bugs are getting fixed. That was the goal. But the cherry-picked fixes have gaps that our patches addressed:

  1. validateOutputPath was only fixed in one of three copies. Gonzih's PR #745 hardens write-commands.ts. The identical vulnerable function in meta-commands.ts and the inline validation in snapshot.ts still use plain path.resolve without realpathSync. Our PR fixed all three.

  2. The fix breaks on macOS. SAFE_DIRECTORIES contains /tmp, but on macOS /tmp is a symlink to /private/tmp. The cherry-picked fix resolves the file path through realpathSync but compares against the unresolved /tmp. Result: realpathSync('/tmp/screenshot.png') returns /private/tmp/screenshot.png, which does not start with /tmp/. Legitimate screenshots get rejected. Our fix resolves both the file path and the safe directories.

  3. No queue entry schema validation. The maintainer added file permissions (0o700 dirs, 0o600 files) and a kill-file mechanism, but queue entry contents are not validated. A local process that can write to the queue file can still inject arbitrary args, stateFile, cwd, and prompt values. Our fix adds isValidQueueEntry with type checks and path traversal guards on all 8 fields.

  4. /health still leaks user activity. The auth token was gated behind a chrome-extension:// Origin header (good). But the unauthenticated /health response still returns the user's current URL and their sidebar AI message text.

Open source works best when contributors are acknowledged, even if alternative implementations are preferred. Filing detailed security issues with reproduction steps, root cause analysis, and working patches takes real effort. When that work is passed over in favor of less complete alternatives, it sends a discouraging signal to the next person considering whether to audit someone else's project for free.

The Toolkit

Everything described here uses two open-source tools:

sqry: AST-based semantic code search. Builds a graph of symbols and relationships across 35+ languages. Exposes 34 MCP tools for AI agents to navigate code structurally.

llm-cli-gateway: Multi-LLM orchestration via MCP. Routes requests through Claude, Codex, and Gemini with session continuity, async job management, and approval gates.

Both are MIT-licensed and run entirely locally.

What We Learned

Independent convergence validates methodology. When other contributors find the same issues through completely different methods, you can trust the results.

Rigorous review improves your own work most of all. 9 rounds of Codex review sounds like a lot. It was. Every round caught something real. The discipline of submitting to review, and actually fixing what is found, is where the quality comes from.

Structural search finds what text search misses. The switchChatTab/pollChat recursion, the validateOutputPath symlink bypass, the CSS injection across 4 separate code paths. These are relationship issues. Understanding code structure is different from searching code text.

Security review is a good way to serve the open source community. Every maintainer has more feature requests than they can handle. A thorough security review with fixes, tests, and documentation is work that helps everyone who uses the project. We are grateful gstack is open source and that we could contribute.


The full security audit report, implementation plan, and all test results are in PR #806. The round 1 report is in PR #664.

sqry: github.com/verivus-oss/sqry
llm-cli-gateway: github.com/verivus-oss/llm-cli-gateway

Top comments (0)