Using sqry's code graph, parallel audit agents, and iterative Codex review to contribute security improvements to gstack.
Garry Tan open-sourced gstack in late March 2026. It is a CLI toolkit for Claude Code with a headless browser, Chrome extension, skill system, and telemetry layer. The project attracted 30+ contributors within days.
We wanted to contribute something useful. Security review seemed like the right fit. A headless browser that spawns subprocesses and handles cookies has a large attack surface, and security work tends to fall to the bottom of every fast-moving project's priority list.
If you haven't read our earlier posts: sqry is an AST-based code search tool. It parses code like a compiler, building a graph of functions, classes, imports, and call relationships across 35+ languages. llm-cli-gateway orchestrates multiple LLMs (Claude, Codex, Gemini) through a single MCP interface. The Codex review gate is our practice of requiring unconditional Codex approval before shipping.
The Codebase
gstack has about 47,000 symbols across 212 files in TypeScript, JavaScript, HTML, CSS, Shell, Ruby, JSON, and SQL. The browse subsystem has a 750-line command handler with a complexity score of 59. The Chrome extension injects into every page the user visits. The sidebar agent spawns Claude subprocesses from a JSONL queue file.
Running grep "exec" on this codebase returns 60+ matches. None of them look obviously wrong. Security review requires understanding relationships between functions, not just finding keywords.
Why grep Falls Short
In a previous article, I described why structural code search matters for this kind of work.
Say you want to find every path from user input to a dangerous sink like Bun.spawn(). grep finds the spawn calls. It does not tell you which functions call those functions, which HTTP endpoints call those functions, or whether any validation sits between the endpoint and the spawn.
sqry does. For gstack, it built a graph of 46,837 nodes and 39,083 edges in 280ms. With all 36 language plugins enabled (including high-cost plugins like JSON and ServiceNow XML), the full graph captures 55,365 raw edges across 212 files.
sqry index . --force --include-high-cost
Files indexed: 212
Symbols: 46,837
Edges: 39,083 canonical (55,365 raw)
Plugins: 36 active
Build time: 280ms
Round 1: 10 Findings, 3 LLMs
Our first audit in March used three LLMs independently. Claude and Codex each found overlapping but non-identical sets of issues. Gemini verified all findings against source. The total was 10 unique vulnerabilities across the browse subsystem. We submitted PR #664 with fixes and filed 10 issues (#665-#675).
What gave us confidence these were real: three other contributors (stedfn, Gonzih, and mehmoodosman) independently found at least 6 of the same issues through completely separate analysis. They had not seen our reports. Independent convergence on the same findings is strong validation.
Round 2: 20 More Findings
For the second audit, we expanded the approach. We dispatched 4 parallel audit agents instead of manually querying sqry:
-
Agent 1:
server.ts, covering HTTP endpoints, auth, and CORS -
Agent 2:
write-commands.ts, the highest-complexity function, covering file ops and cookie handling -
Agent 3:
meta-commands.ts, covering command parsing, state management, and frame targeting -
Agent 4:
extension/, covering the Chrome extension sidepanel, inspector, and background worker
Each agent had full sqry MCP access with instructions to look for issues beyond the 10 we had already reported. They returned 25 raw findings. After cross-referencing against 20+ existing community issues and the maintainer's own security work (he had already landed two security-focused PRs), 16 were new. Four more gaps turned up during implementation review.
A Subtle but Serious Finding
# bin/gstack-learnings-search, lines 46-52
cat "${FILES[@]}" 2>/dev/null | bun -e "
const type = '${TYPE}';
const query = '${QUERY}'.toLowerCase();
const limit = ${LIMIT};
const slug = '${SLUG}';
Bash variables are interpolated directly into JavaScript string literals via bun -e. A branch name containing a single quote, like fix'; process.exit(1); //, would break out of the JS string and execute arbitrary code. Easy to write, hard to spot in review.
The fix: pass parameters via environment variables instead of string interpolation.
cat "${FILES[@]}" 2>/dev/null | \
GSTACK_FILTER_TYPE="$TYPE" \
GSTACK_FILTER_QUERY="$QUERY" \
GSTACK_FILTER_LIMIT="$LIMIT" \
bun -e "
const type = process.env.GSTACK_FILTER_TYPE || '';
const query = (process.env.GSTACK_FILTER_QUERY || '').toLowerCase();
const limit = parseInt(process.env.GSTACK_FILTER_LIMIT || '10', 10) || 10;
Environment variables are never interpreted as code. The injection vector disappears.
A Finding sqry Made Possible
sqry's find_cycles tool detected a mutual recursion between switchChatTab and pollChat in the Chrome extension's sidepanel:
switchChatTab -> pollChat -> switchChatTab (cycle depth: 2)
pollChat fetches the server's active tab ID. If it differs from the client's, it calls switchChatTab. switchChatTab sets state and immediately calls pollChat. If the server keeps returning a different tab ID during rapid switching, this creates unbounded stack recursion.
grep will not surface this. The bug lives in the relationship between two functions, and that relationship only becomes visible in the call graph.
The Full List
| # | Severity | Finding |
|---|---|---|
| 1 | HIGH | Shell injection via bash-to-JS interpolation |
| 2 | MED-HIGH | Queue file permissions allow local prompt injection |
| 3 | MED |
/health endpoint exposes user activity without auth |
| 4 | MED | ReDoS via new RegExp(userInput) in frame targeting |
| 5 | MED |
chain command bypasses watch-mode write guard |
| 6 | MED |
cookie-import allows cross-domain cookie planting |
| 7 | MED | CSS values unvalidated at 4 injection points |
| 8 | MED | Session directory traversal via crafted active.json
|
| 9 | MED |
responsive screenshots skip path validation |
| 10 | MED |
validateOutputPath uses path.resolve, not realpathSync* |
| 11 | MED |
state load navigates to unvalidated URLs |
| 12 | MED | DOM serialization round-trip enables XSS on tab switch |
| 13 | MED |
switchChatTab/pollChat mutual recursion |
| 14 | MED |
cookie-import-browser --domain accepts unvalidated input |
| 15-20 | LOW | Info disclosure, timeout handling, bounds validation, prompt injection surface |
*Finding 10 is a common pattern worth highlighting:
// BEFORE: resolves logically, symlinks pass through
const resolved = path.resolve(filePath); // /tmp/safe -> still "/tmp/safe"
// AFTER: resolves physically, symlinks followed to real target
const resolved = realpathSync(filePath); // /tmp/safe -> "/etc/shadow" (blocked!)
A symlink at /tmp/safe pointing to /etc would pass path.resolve validation but fail realpathSync, because the real path is outside the safe directory.
The Codex Review Gate
In a previous article, I described how we use Codex as a mandatory review gate. Unconditional approval or the work does not ship. Codex earned this role through specificity. Where a generic reviewer might say "consider improving error handling," Codex pinpoints "the catch block on line 47 swallows errors silently." It also has a low false-positive rate, which keeps the gate credible over time.
For this security plan, Codex went through 9 rounds before approving. Honestly, this says more about our work than the tool. It kept finding legitimate gaps in our plan:
| Round | Verdict | What Codex caught |
|---|---|---|
| 1 | REQUEST_CHANGES | 9 items: missing tasks, incomplete validation, stale line numbers |
| 2 | REQUEST_CHANGES | 4 items: phantom files that do not exist, wrong field types |
| 3 | REQUEST_CHANGES | 2 items: test too broad, tabId should be number not string |
| 4 | REQUEST_CHANGES | 2 items: undefined variable in test, field assertions incomplete |
| 5 | REQUEST_CHANGES | 2 items: null handling missing, wrong command name |
| 6 | REQUEST_CHANGES | 2 items: test could pass without the fix, summary stale |
| 7 | REQUEST_CHANGES | 1 item: regex matches unrelated code elsewhere in file |
| 8 | REQUEST_CHANGES | 1 item: fixed-width text slice bleeds into adjacent functions |
| 9 | APPROVE | Brace-walking function body extractor |
Round 2 caught that our queue validator used string for tabId when the actual writer emits number. Round 5 caught that null values (which the real writer produces) would be rejected. Round 8 caught that our test's 1500-character slice could accidentally include adjacent code, satisfying assertions without the fix being correct.
Each round made the plan more precise. By round 9, every test was scoped to the exact function body it validates, every type matched the real writer, and every file in scope had a corresponding task. Even when you feel confident, a rigorous reviewer makes the work better.
Implementation: Subagent-Driven Development
With an approved plan, we dispatched one implementation subagent per task, 18 tasks total. Each subagent:
- Read the specific source files
- Created failing tests
- Implemented the fix
- Verified tests pass
- Committed
A mid-implementation code review by a separate review agent caught 4 additional gaps we had missed:
-
applyStylein the extension was missing the same CSS validation added to 3 other injection points -
snapshot.tsstill used the oldpath.resolvepattern -
stateFilein queue entries had no path traversal check -
cookie-import's read path validation used the old pattern
All fixed before continuing. That is why you review.
Test Results
Security regression tests: 119 pass, 0 fail [47ms]
E2E evals (Docker + Chromium): 33 pass, 0 regressions
Previously-failing browse tests: all 3 now pass
The E2E evals ran inside a Docker container (Ubuntu 24.04, Chromium 145, Playwright 1.58.2, --cap-add SYS_ADMIN for the Chromium sandbox). One unrelated test (qa-bootstrap) failed due to test infrastructure.
A Note on Attribution and Completeness
On April 5, the maintainer landed PR #810, titled "security wave 1," which fixed 14 issues from seanomich's audit (#783). Several of those fixes address the same vulnerabilities we reported in our round 1 issues (#665-#675), filed on March 30.
The fixes were cherry-picked from Gonzih (#743, #744, #745, #750, #751) and garagon (#803), who independently found and fixed the same bugs days after our issues were filed. Our PR #664 and 10 issues remain open without comment.
We are glad the bugs are getting fixed. That was the goal. But the cherry-picked fixes have gaps that our patches addressed:
validateOutputPathwas only fixed in one of three copies. Gonzih's PR #745 hardenswrite-commands.ts. The identical vulnerable function inmeta-commands.tsand the inline validation insnapshot.tsstill use plainpath.resolvewithoutrealpathSync. Our PR fixed all three.The fix breaks on macOS.
SAFE_DIRECTORIEScontains/tmp, but on macOS/tmpis a symlink to/private/tmp. The cherry-picked fix resolves the file path throughrealpathSyncbut compares against the unresolved/tmp. Result:realpathSync('/tmp/screenshot.png')returns/private/tmp/screenshot.png, which does not start with/tmp/. Legitimate screenshots get rejected. Our fix resolves both the file path and the safe directories.No queue entry schema validation. The maintainer added file permissions (0o700 dirs, 0o600 files) and a kill-file mechanism, but queue entry contents are not validated. A local process that can write to the queue file can still inject arbitrary
args,stateFile,cwd, andpromptvalues. Our fix addsisValidQueueEntrywith type checks and path traversal guards on all 8 fields./healthstill leaks user activity. The auth token was gated behind achrome-extension://Origin header (good). But the unauthenticated/healthresponse still returns the user's current URL and their sidebar AI message text.
Open source works best when contributors are acknowledged, even if alternative implementations are preferred. Filing detailed security issues with reproduction steps, root cause analysis, and working patches takes real effort. When that work is passed over in favor of less complete alternatives, it sends a discouraging signal to the next person considering whether to audit someone else's project for free.
The Toolkit
Everything described here uses two open-source tools:
sqry: AST-based semantic code search. Builds a graph of symbols and relationships across 35+ languages. Exposes 34 MCP tools for AI agents to navigate code structurally.
llm-cli-gateway: Multi-LLM orchestration via MCP. Routes requests through Claude, Codex, and Gemini with session continuity, async job management, and approval gates.
Both are MIT-licensed and run entirely locally.
What We Learned
Independent convergence validates methodology. When other contributors find the same issues through completely different methods, you can trust the results.
Rigorous review improves your own work most of all. 9 rounds of Codex review sounds like a lot. It was. Every round caught something real. The discipline of submitting to review, and actually fixing what is found, is where the quality comes from.
Structural search finds what text search misses. The switchChatTab/pollChat recursion, the validateOutputPath symlink bypass, the CSS injection across 4 separate code paths. These are relationship issues. Understanding code structure is different from searching code text.
Security review is a good way to serve the open source community. Every maintainer has more feature requests than they can handle. A thorough security review with fixes, tests, and documentation is work that helps everyone who uses the project. We are grateful gstack is open source and that we could contribute.
The full security audit report, implementation plan, and all test results are in PR #806. The round 1 report is in PR #664.
sqry: github.com/verivus-oss/sqry
llm-cli-gateway: github.com/verivus-oss/llm-cli-gateway
Top comments (0)