Werner Kasselman

Posted on Jun 14

Hardening API Scan Boundaries in skill-scanner, with sqry as the Review Map

#ai #security #python #opensource

On 14 June 2026 I cloned cisco-ai-defense/skill-scanner, set up the locked uv environment, and worked through one small but important question: what does it take to make the REST API safer when the API can scan local directories, accept uploaded ZIP files, run optional analyzers, and queue batch work in the background?

I am not pretending this is a universal API security methodology, or that one branch makes a whole product "secure" in the abstract. This is a narrower story, and I think the narrowness is the useful part: a concrete pass over one public Python repository, with a hardening branch called codex/harden-api-scan-boundaries, ending in commit 2cfa313 and draft PR #119, where the evidence was code, tests, docs, and a graph of the repository rather than a confident read of the obvious files.

The branch changed 24 files, with 1186 insertions and 210 deletions. The main implementation files were skill_scanner/api/router.py, skill_scanner/core/analyzer_factory.py, skill_scanner/core/extractors/content_extractor.py, skill_scanner/core/loader.py, and skill_scanner/core/scanner.py, plus two new shared modules: skill_scanner/core/archive_limits.py and skill_scanner/core/fs_limits.py.

The target: an API wrapped around local scanning work

skill-scanner scans Agent Skill packages. It has CLI paths, Python library paths, eval paths, pre-commit hook paths, and a FastAPI router that exposes endpoints for direct skill scans, uploaded ZIP scans, batch scans, batch-result polling, health checks, and analyzer listing.

That matters because the REST API does not sit in front of a simple database lookup. It sits in front of local filesystem access, archive extraction, analyzer construction, optional remote-service analyzers such as VirusTotal and Cisco AI Defense, LLM-backed analysis, scanner traversal, loader discovery, and report generation. A bug in one visible route handler can be obvious. A missing bound in a shared loader, reached through API, CLI, evals, tests, and scanner methods, is much easier to miss.

The first setup step was boring and necessary:

uv sync --frozen --all-extras --dev

That gave the API dependencies, analyzer extras, pytest, lint tooling, and the project commands needed to move from reading code to running it. The repository also had clear contribution constraints in CONTRIBUTING.md: include tests for changed behaviour, update docs where behaviour or configuration changes, use a conventional commit, keep the uv.lock model intact, and verify with the repository's normal commands.

The hardening target became four broad risk classes:

API callers should not be able to turn server-side path handling into arbitrary filesystem access.
Uploaded archive names and archive contents should not control where the server writes or what it follows.
Request-controlled expensive work needs caps, especially batch scans, traversal, archive expansion, loader discovery, and LLM consensus runs.
Operator-side configuration should remain operator-side configuration, especially remote analyzer endpoints.

There is also the basic API boundary: scan work and scan-result retrieval now require X-API-Key, and the expensive endpoints have process-local rate limiting. Root, health, and analyzer-listing endpoints remain informational.

Why sqry changed the shape of the review

The tool that changed the review was sqry, version 20.0.5. sqry uses "semantic" in the compiler sense, it parses code into ASTs, builds a graph of symbols and relationships, and answers structural questions from that graph. It is not an embedding search tool, and it is not just grep with better ranking.

The local index for this repository had 20,445 symbols across 202 files, with relation support enabled. The graph manifest recorded 26,120 edges across 200 Python files, one Ruby file, and one shell file. That is the practical reason it helped here: the API hardening problem crossed API request models, FastAPI handlers, shared scan implementation, analyzer construction, scanner traversal, loader discovery, archive extraction, documentation, and tests.

The first useful query was not clever:

sqry query 'path:skill_scanner/api/router.py AND kind:function'

It returned 98 function symbols from skill_scanner/api/router.py in about 35 ms on this checkout. More importantly, it produced a checklist that included scan_skill, _scan_skill_impl, scan_uploaded_skill, scan_batch, get_batch_scan_result, run_batch_scan, _validate_path, _count_batch_candidates, and _build_analyzers.

That sounds mundane until you compare it with a manual route read. A manual read tends to start from decorators and then follow the code that looks important. sqry gave me the public route handlers and the helpers in one structural inventory, before I had decided which parts mattered.

The scanner side was the same:

sqry query 'path:skill_scanner/core/scanner.py AND kind:function'

That returned 76 function symbols in about 31 ms, including SkillScanner.scan_skill, SkillScanner.scan_directory, and _find_skill_directories. The useful distinction was between single-skill scanning, directory discovery, and module-level convenience functions. For a hardening pass, that distinction is load-bearing.

Then the review shifted from "where is this string?" to "what code can reach this behaviour?"

sqry graph direct-callers _validate_path --json

sqry reported four direct callers: _resolve_policy, _scan_skill_impl, scan_batch, and run_batch_scan. That made the path gate concrete. It was not enough to harden the direct /scan path. The same gate needed to cover policy paths, direct skill paths, batch roots before queuing, and batch execution inside the background task.

The loader trace was the bigger warning:

sqry graph direct-callers 'SkillLoader.load_skill' --json

That returned 92 direct callers across evals, API code, CLI code, scanner code, and tests. This is where plain text search is weak. You can find load_skill text matches, but you still have to reason manually about which are method calls, convenience wrappers, test helpers, and shared execution paths. sqry made the broad shared surface visible, which is why the fix did not stop at the API router. The loader itself needed a bounded contract.

The same pattern showed up in analyzer construction. build_analyzers had 11 direct callers across API, CLI, hooks, evals, and tests. That meant llm_consensus_runs needed two checks: request-model validation at the API edge, and a second cap inside the analyzer factory so non-API callers get the same invariant.

For LLMAnalyzer._consensus_analyze, sqry reported one direct caller, LLMAnalyzer.analyze_async, which kept the execution-side analysis focussed. The cap belongs before construction reaches the analyzer loop.

Plain rg still had a place for exact strings, route decorators, docs, and final sanity checks. The difference is that sqry gave the graph-backed layer: functions and methods instead of arbitrary text, same-name symbols separated across API, CLI, hooks, evals and tests, and caller/callee traces for security-sensitive helpers.

What was fixed

The API path boundary now fails closed. _validate_path rejects null bytes, resolves the supplied path, and denies access unless SKILL_SCANNER_ALLOWED_ROOTS is configured and the resolved path is inside one of those roots. If no roots are configured, API filesystem access is denied.

That is a deliberate posture. An API that scans local paths should not assume that "current working directory" is a sensible trust boundary, and it should not silently accept arbitrary absolute paths because the caller knows them.

The upload path changed in a similarly blunt way. /scan-upload still checks the client-provided filename to require a .zip upload, but the server no longer uses that filename for the staging path. Uploaded bytes are written to:

zip_path = temp_dir / "upload.zip"

That small line removes an entire class of filename-controlled staging behaviour. Around it, the upload flow now streams in 1 MB chunks, enforces a 50 MB upload limit, reads ZIP EOCD metadata before constructing ZipFile, rejects ZIPs over 500 entries, rejects uncompressed ZIP contents over 200 MB, rejects path traversal entries by resolving each destination under the extraction root, rejects symlink entries, checks again after extraction that no symlink appeared on disk, and only then searches the extracted tree for SKILL.md using a bounded walk.

The EOCD preflight lives in skill_scanner/core/archive_limits.py as read_zip_member_count. It reads the ZIP end-of-central-directory metadata, including the ZIP64 case, before the code has to build a ZipFile object and iterate the archive. The same helper is used by the API upload handler and by ContentExtractor, so archive member-count limits are not two unrelated implementations that can drift.

The traversal helpers live in skill_scanner/core/fs_limits.py:

iter_directory_bounded
walk_directory_bounded

Both are based on os.scandir, and both count entries as they are yielded rather than first materialising a whole tree. They are now used by API batch preflight, scanner directory discovery, loader file discovery, lenient markdown synthesis, and uploaded-tree search. That is the kind of change that looks less exciting than a route patch, but it is exactly where the graph evidence mattered. If the loader has 92 direct callers, the loader cannot depend on the API being the only adult in the room.

Batch scanning now validates the batch root, counts candidates before queueing background work, rejects requests over the configured candidate limit, and passes bounds into SkillScanner.scan_directory:

max_candidates=MAX_BATCH_SKILLS
max_entries_visited=MAX_BATCH_PATHS_VISITED

The default values in the API are 100 candidate skills and 10,000 filesystem entries. The scanner then passes loader bounds into SkillLoader.load_skill, which means the per-skill load step is part of the same bounded execution path rather than an unbounded second phase.

The analyzer boundary changed too. llm_consensus_runs is capped in the API request models with Pydantic, and again in build_analyzers. The API no longer exposes a remote-callable Cisco AI Defense URL override; the analyzer factory can still use operator-controlled arguments and environment configuration, including AI_DEFENSE_API_URL, but the public request model does not let a caller pick the remote endpoint for the server.

Finally, scan endpoints now require X-API-Key backed by SKILL_SCANNER_API_KEY. /scan, /scan-upload, /scan-batch, and /scan-batch/{scan_id} all check it. The result cache for batch scans is also bounded: 1,000 entries, with a 3600 second TTL. The rate limiter is deliberately process-local, configurable through SKILL_SCANNER_API_RATE_LIMIT_REQUESTS and SKILL_SCANNER_API_RATE_LIMIT_WINDOW_SECONDS; that is useful for this server, but it is not a distributed quota system, and the docs should make that kind of caveat visible.

Tests and docs closed the loop

The branch did not stop at implementation. Tests were added or updated across:

tests/test_api_endpoints.py
tests/test_api_deep.py
tests/test_analyzer_factory.py
tests/test_loader.py
tests/test_scanner.py
tests/test_extractors.py
tests/test_cli_tui_api_fixes.py

The focussed verification command was:

uv run pytest \
  tests/test_api_endpoints.py \
  tests/test_api_deep.py \
  tests/test_analyzer_factory.py \
  tests/test_loader.py \
  tests/test_scanner.py \
  tests/test_extractors.py \
  tests/test_cli_tui_api_fixes.py \
  -q

On the current checkout, that collected 216 tests and returned 215 passed, 1 skipped on Python 3.13.13, with only third-party deprecation warnings. The process report also records a broader non-integration, non-LLM, non-e2e run at 1308 passed, 5 skipped, 7 deselected, plus ruff check . and git diff --check during the contribution.

The documentation updates matter because this is not only a code contract. .env.example, API docs, operations docs, endpoint detail pages, and generated reference docs now describe SKILL_SCANNER_API_KEY, SKILL_SCANNER_ALLOWED_ROOTS, rate limits, traversal limits, archive limits, batch limits, and the LLM consensus cap. A security control that exists only in code is easier to bypass operationally than one that is named in the configuration surface people actually read.

What this says about AI-assisted code review

The useful lesson here is not "AI found security bugs". That is too vague, and frankly not the interesting part.

The useful lesson is that AI-assisted review gets much better when the agent is forced to work from repository facts that can be rerun: symbol inventories, caller traces, callee traces, exact changed files, test names, and concrete verification commands. A model can read the most obvious route handler and sound convincing. A graph can show that the helper under discussion has four direct callers, or that a loader method has 92 direct callers, and that changes the review from opinion to coverage.

That is where sqry was valuable. It made the review faster, but the speed was not the main win. The main win was not having to trust a first-pass mental map of the codebase. The map was queryable, and when the map said the loader was shared across API, CLI, eval, scanner, and tests, the fix moved down into the loader. When the map said analyzer construction was shared, the consensus cap moved into the factory as well as the API request model.

This is also why I do not like abstract claims about "secure by design" unless the design names the boundary and the evidence. In this branch, the claims are more modest and more useful: API path access fails closed without configured roots; uploaded filenames no longer control staging paths; archive expansion has member, size, traversal, and symlink checks; batch discovery and scanner traversal have explicit limits; loader discovery has explicit limits; LLM consensus runs are capped at both the request and factory boundary; the focussed suite passes.

Those are claims a maintainer can inspect.

A note from adjacent SkillSpector work

The same pattern showed up while working through issues in NVIDIA SkillSpector: Stage 2 LLM batch failures, retry and concurrency behaviour, unanalyzed findings, ingest-layer bounds, and whitespace-padding detection all ended up being boundary questions. Different repository, different implementation, same shape of problem.

This is the part that feels important to me. AI-assisted development can help us ship faster, but faster shipping also means we can expose larger attack surfaces sooner: more API entry points, more archive and clone paths, more model calls, more background work, more places where a scanner accepts untrusted input. The answer is not to slow everything down by default; it is to make boundary review part of the shipping motion, with concrete limits, tests, and code-graph evidence before the surface gets too wide to reason about.

Takeaways

Start with the execution surface, not the file you happen to be reading. For this branch, the API surface crossed router, scanner, loader, extractor, analyzer factory, docs, and tests.
Use text search for strings, but use AST and graph search for structure. Same-name symbols across API, CLI, hooks, evals, and tests are not one behaviour.
Put limits where shared code is reached, not only where public requests enter. The loader trace is the obvious example here.
Make fail-closed behaviour explicit. SKILL_SCANNER_ALLOWED_ROOTS being absent means no API path access, not "scan whatever path was supplied".
Treat docs as part of the control surface. If an operator must set API keys, allowed roots, traversal caps, or archive limits, the docs need to say so in the places operators read.

Thanks for reading this far, I hope this is useful if you are hardening an API that wraps local filesystem work, archive extraction, or other expensive scanner-style behaviour. The bit I would reuse first is not any single line of code, it is the habit of asking the repository graph where the boundary actually runs before deciding where the fix belongs.