DEV Community

honouralexwill
honouralexwill

Posted on

We Scanned 23,794 OpenClaw Skills. Here Is What the Full Governance Scan Found

The strongest conclusion is simple: There is enough broken, insecure, or incomplete code in the OpenClaw corpus to justify systematic scanning before installation.

OpenClaw has a large public skill ecosystem. That creates obvious upside, but also obvious risk: most users do not inspect every skill they install, and many of those skills are written or assisted by code generation tools.
We ran the full Saturnday governance engine across the entire public OpenClaw corpus, covering security, dependency integrity, testing, code quality, and project hygiene across Python, TypeScript, JavaScript, and shell.

The result is a stark picture of what happens when AI generated or AI assisted code is published at scale with limited review.

What is Saturnday?

Saturnday is a terminal-first, open-source governance runtime for AI coding tools. You describe the project in plain English, and Saturnday plans the build, splits it into tickets, executes each step through your existing AI coder, runs security and quality checks after every commit, auto-repairs failures, and keeps a full evidence trail so unchecked code does not quietly reach the repository. Learn more at saturnday.dev.

Install it with pip install saturnday.

What We Scanned
The scan covered the full public OpenClaw skill corpus.
Corpus summary
Total skills in corpus: 23,794
Skills with scannable code: 10,551
Skills without code, docs only: 13,243
Scan duration: 3 hours 22 minutes
Total findings: 401,052
Distinct finding types: 83
Scan failures: 4

That split matters. More than half of the corpus does not contain scannable code at all. The meaningful denominator for most technical findings is therefore 10,551 code-containing skills, not the full 23,794.
Full data: The complete scan results - 401,052 individual findings - are publicly available at 

github.com/honouralexwill/openclaw-governance-scan

Every finding includes the skill name, check name, file, and detail. Verify any number in this article against the raw data.
The Build Under Test
This scan used:
Scanner: Saturnday v0.3.10
Engine: full 50 plus check governance engine
Date: 2026–03–23

The scope included Python, TypeScript, JavaScript, and shell files in each skill directory.

Headline Results
The full governance scan produced **401,052 **findings across 83 distinct finding types.
That raw number is real, but it needs interpretation. Aggregate totals can mislead if they mix fundamentally different categories together.
Category breakdown
Security, injection, XSS, CSRF: 6,941
Hardcoded secrets: 734
Auth and session security: 2,505
Code quality: 129,729
Dependencies and imports: 219,152
Testing: 5,602
Project hygiene: 36,389

The central finding is clear: dependency and import failures dominate the corpus.

What Actually Matters
Dependency and import failures are the largest problem
This is the single biggest category by far.
Dependency and import findings
import_check: 106,748
package_not_importable: 43,517
declared_not_installed: 21,704
dependency_declaration: 18,048
possible_typosquat: 1,641
python_version_compat: 6,577
hallucinated_import: 37

Taken together, these findings suggest that a large share of OpenClaw code is not just risky, but structurally broken, incomplete, or incoherent as a project. AI assisted code often imports packages that are missing, declares packages that do not import, or uses modules that do not exist.
The 37 hallucinated imports are especially important. They are a narrow but high signal category because they point to AI generated dependencies that were never published.
This matters for two reasons. First, it means many skills are likely to fail before they deliver any useful work. Second, non-existent dependency names can become a supply chain exposure if someone later publishes a malicious package under the same name.
The broader point is uncomfortable but simple: a large fraction of the problem is not subtle. It is basic project incoherence.

Security findings are broader than most people assume
The full governance engine surfaced a substantial number of conventional security flaws.
Critical security findings
sql_string_building: 3,410
route_no_auth: 1,709
dangerous_xss_sink: 697
idor_no_ownership: 439
csrf_missing: 424
template_injection: 156
api_template_injection: 96
concat_injection: 9
unsafe_email_link: 1

These findings point to common application failure modes inside skills: SQL built by string concatenation, routes with no authentication, unsanitised user input rendered into HTML, and state changing behaviour with no CSRF protection.
That is not merely messy generated code. It suggests that some skills are shipping patterns that would be dangerous in any production software context.
The volume also matters. These are not isolated curiosities. They appear often enough to suggest that weak security defaults are part of the ecosystem's normal output.

Authentication and session security are also weak
The scan found 2,505 findings in authentication and session handling alone.
Auth and session findings
ws_no_origin_check: 404
missing_security_logging: 390
unstructured_security_logging: 290
missing_pkce: 304
missing_oauth_state: 285
ws_no_auth: 249
weak_random_in_auth: 227
memory_store_rate_limit: 119
user_enumeration: 88
missing_token_invalidation: 79
missing_token_expiry: 39
rate_limit_not_applied: 18
session_cookie_no_httponly: 6
missing_samesite: 6
unpinned_algorithm: 1

This matters because many OpenClaw skills now behave less like simple scripts and more like lightweight applications. Once a skill exposes routes, sessions, WebSockets, or OAuth flows, the absence of basic auth hygiene becomes a security issue, not a style issue.
This is one of the clearest signs that the ecosystem has outgrown casual assumptions. A script with poor formatting is one thing. A skill that behaves like an application but ignores basic session and auth controls is something else.

Hardcoded secrets remain real, but should still be described carefully
The scan found 734 hardcoded secret findings.
Secret findings
generic_secret: 359
hardcoded_secret: 101
api_key: 83
env_fallback_secret: 37
payment_secret_literal: 33
github_token: 31
private_key: 23
openai_key: 17
bearer_token: 14
aws_key: 13
anthropic_key: 8
slack_token: 7
npm_token: 6

This is a serious signal. But the careful interpretation is the same as always: these are findings for review, not proof that every match is an active production credential.
Still, the category is too large to dismiss. Even allowing for false positives, hardcoded credentials, token fallbacks, and embedded secrets remain a recurring pattern in the corpus.
That is especially troubling in a public skill ecosystem where users may install code with little or no review.

Testing remains weak
The scan found 5,602 testing related findings.
Testing findings
test_no_assert: 4,846
tests_failing: 614
tautological_assertion: 52
tautological_assert: 7
skipped_test: 78
assert_caught: 4
tests_timeout: 1

This is one of the clearest recurring patterns in AI assisted code. Many skills contain tests that look legitimate but either assert nothing, fail when executed, or rely on assertions that can never meaningfully fail.
That does not mean every flagged test is worthless. It does mean that the presence of a tests/ directory should not be treated as evidence of quality.

A passing test suite is meaningful only if it actually verifies behaviour. Large amounts of performative testing create the appearance of discipline without delivering any of its benefits.

  1. Project hygiene is poor at scale The scan found 36,389 project hygiene findings. Project hygiene findings readme_missing_section: 11,195 missing_license: 10,369 missing_readme: 6,763 missing_project_config: 5,455 missing_package_json: 1,254 missing_tsconfig: 441 excessive_blast_radius: 351 syntax_error: 324 readme_language_mismatch: 211 unpinned_dependency: 25 typosquat: 1

These are not cosmetic complaints. Missing README files, missing licences, missing package metadata, and missing project config all affect whether code can be understood, installed, distributed, or maintained.

A public registry full of code without documentation, installation metadata, or licensing clarity is not just untidy. It is operationally hostile to users who are trying to assess what they are installing.

The Core Point
The ecosystem does not just contain risky code. It contains a large amount of code that is incomplete, incoherent, untestable, or operationally broken.

That is the real story. Not one dramatic vulnerability class. Not one exotic AI failure mode. A broad pattern of low quality, weak packaging, shallow testing, broken dependencies, and avoidable security flaws.
That pattern matters because users do not experience findings as neat categories. They experience them as failed installs, unpredictable behaviour, hidden exposure, and wasted time.

Scan Failures
Only 4 skills out of 23,794 failed to scan.
The reported causes were non UTF 8 encoded files in three named skills, plus one unnamed case counted in the total but not logged.
That is a failure rate of 0.017%, which is low enough to support the claim that the scanner handled the corpus robustly.

What This Scan Supports
This scan supports the following claims:
Saturnday scanned the full public OpenClaw corpus of 23,794 skills
10,551 of those skills contained scannable code
the full scan produced 401,052 findings across 83 finding types
the scanner works across Python, TypeScript, JavaScript, and shell
dependency and import problems are the largest category by far
the corpus contains substantial numbers of security findings involving SQL construction, missing auth, XSS, CSRF, OAuth, and session handling
the corpus also contains hardcoded secrets, weak tests, missing project metadata, and syntax failures
the full governance scan completed with only 4 failures

What This Scan Does Not Support

This scan does not prove:
zero false positives
full recall
exploitability for every flagged security pattern
that every hardcoded secret is a live credential
that every flagged test is definitely worthless
that every skill with a finding is unsafe to install
end to end governance effectiveness outside the scanner itself

Those would require separate evidence.
What OpenClaw Users Should Take From This
If you install third party skills, the practical lessons are straightforward:
Do not assume a public skill is installable just because it exists in the registry.
Check dependencies and imports first. That is now the biggest structural failure mode.
Treat routes, WebSockets, OAuth handlers, and database code as high risk areas.
Do not trust the existence of tests without inspecting what those tests actually verify.
Be cautious with skills that ship with no README, no licence, or no package configuration.
Treat hardcoded secret matches as serious review items, but not as automatic proof in every case.

The Bottom Line
The public OpenClaw corpus does not just contain scattered AI specific mistakes. It contains widespread dependency breakage, weak testing, poor project hygiene, and a non trivial volume of conventional security flaws.
That does not mean every skill is dangerous. It does mean the ecosystem is noisy enough, and the failure patterns are common enough, that installing unreviewed skills should be treated as a genuine engineering risk.

Top comments (0)