PythonWoods

Posted on Apr 15

Your Docs Pipeline Is a Security Risk — Zenzic v0.6.1rc1 Fixes That

#opensource #python #security #devtools

Most documentation pipelines trust Markdown blindly. Unvalidated links, hidden credential leaks, path traversal risks, engine-specific blind spots — all of this happens before your build system even knows something is wrong.

Zenzic exists to close that gap.

In Part 1, I explained why I built it — the philosophy, the threat model, the architecture of a Pure Python analyzer that lints raw Markdown sources before any build engine touches them.

Today, v0.6.1rc1 "Obsidian Bastion" turns that idea into something much bigger: not just a linter, but a security layer for any Markdown-based documentation stack.

🎯 Where Zenzic fits

If your documentation is part of your CI pipeline, it's part of your attack surface.

Zenzic is designed for CI pipelines that handle untrusted docs, open-source projects with external contributors, teams running multiple doc engines side by side, and security-conscious workflows that need to validate content before the build — not after. Most tools in this space are engine-specific, runtime-dependent, or rely on shelling out to external processes. Zenzic is none of these.

Three core properties define it:

No subprocess execution — ever. No node, no git, no shell calls. The core library is 100% Pure Python. This isn't a convenience feature — it's a security model. A tool that spawns subprocesses is a tool that can be tricked into executing untrusted code.

Engine-agnostic analysis. Zenzic reads raw Markdown and configuration files as plain data. It never imports or executes a documentation framework. Engine-specific knowledge lives in thin, replaceable adapters that translate semantics into a neutral protocol. The core sees only a BaseAdapter — it doesn't know whether you run MkDocs, Docusaurus, or something that doesn't exist yet.

Deterministic file discovery. Every file scan is explicit. Every path is validated. There are no accidental full-repo traversals, no hidden directories slipping through. Identical source files always produce identical results.

🏛️ From linter to platform

When I wrote Part 1, Zenzic was The Sentinel — a capable linter with MkDocs awareness. It could find broken links, detect credentials, and catch orphaned pages. But it had a blind spot: it could only see one documentation ecosystem.

The 0.6.x series was about removing that limitation entirely. The goal was to build a documentation security layer, not a plugin.

Version	Codename	Focus
v0.5.x	The Sentinel	Core scanning + MkDocs awareness
v0.6.0	Obsidian Glass	Headless architecture
v0.6.1rc1	Obsidian Bastion	Platform baseline

The biggest single commit in this arc deleted 21,870 lines and added 888. That was the Headless Architecture transition: Zenzic stopped being a MkDocs tool and became an Analyser of Documentation Platforms. The documentation site itself was separated into its own Docusaurus-powered repository — and Zenzic now validates it using the same engine-agnostic machinery it offers to everyone else.

⚛️ Parsing Docusaurus without Node

The first concrete challenge was supporting Docusaurus v3. Its config files are TypeScript:

export default {
  presets: [['classic', { docs: { routeBasePath: '/guides' } }]],
  i18n: { defaultLocale: 'en', locales: ['en', 'it'] },
};

The obvious solution — calling node to evaluate the config — would violate Pillar 2 (No Subprocesses). So I built a static parser in Pure Python that extracts baseUrl, routeBasePath, locale configuration, and plugin metadata directly from the source text. No evaluation. No runtime. No JavaScript.

The adapter handles .md and .mdx sources, frontmatter slug: resolution (absolute and relative), _-prefixed exclusion (Docusaurus convention), auto-generated sidebar mode, and full i18n locale tree discovery. When it encounters dynamic config patterns (async, import(), require()), it falls back gracefully instead of crashing.

This matters beyond Docusaurus. It proves that Zenzic's Pure Python core can secure a JavaScript-based documentation stack with zero Node.js dependencies. 65 tests validate the adapter across 12 test classes.

🧱 Layered Exclusion — the real headline feature

File discovery is where most documentation tools quietly fail. A scanner that recursively walks every directory will eventually read inside .git/, node_modules/, or __pycache__/. In the best case, this is slow. In the worst case, it's a security incident.

The Layered Exclusion Manager replaces all ad-hoc directory filtering in Zenzic with a deterministic 4-level hierarchy:

Level	Source	Behavior
L1	System guardrails	Immutable — `.git`, `node_modules`, `__pycache__`, etc.
L2	`.gitignore` + forced inclusions	Additive rules, parsed in Pure Python
L3	Config (`zenzic.toml`)	`excluded_dirs` / `excluded_file_patterns`
L4	CLI flags	`--exclude-dir` / `--include-dir` at runtime

The levels are not just a convenience API — they encode a security invariant. L1 System Guardrails are immutable: no configuration file and no CLI flag can force Zenzic to scan inside .git/ or node_modules/. This is a deliberate architectural decision. A tool that can be configured to read arbitrary system directories is a tool that can be weaponized.

At L2, .gitignore is interpreted by a built-in VCS Ignore Parser — a Pure Python .gitignore interpreter with pre-compiled regex patterns. No calls to git check-ignore. No subprocess.

At L4, a CI operator can --include-dir vendor/critical-patch/ without touching config files, or --exclude-dir drafts/ for a specific run. The hierarchy is predictable: higher levels always win.

🗡️ The Tabula Rasa refactor

This was the most invasive change in the entire release arc. I removed every single rglob() call from the codebase — all of them — and replaced them with two centralized functions in discovery.py:

def walk_files(root, exclusion_manager) -> Iterator[Path]: ...
def iter_markdown_sources(root, exclusion_manager) -> Iterator[Path]: ...

The exclusion_manager parameter is mandatory. Not Optional, no None default. If you call a scanner or validator entry point without an ExclusionManager, you get a TypeError at call time — not a silent full-tree scan at runtime.

168 call sites were updated across 13 test files. The result: accidental full-repo scans are now architecturally impossible. Every traversal is explicit, filtered, and auditable. This eliminates a common source of CI slowdowns and — more importantly — removes a class of security blind spots where sensitive directories could be inadvertently read.

🔐 Security hardening

Two targeted fixes closed real attack vectors identified during internal review.

ReDoS prevention (F2-1). Lines exceeding 1 MiB are silently truncated before Shield regex matching. A crafted documentation file with a multi-megabyte line could exploit catastrophic backtracking in credential detection patterns. This is not a theoretical concern — ReDoS is a well-documented attack against input validation layers that use unbounded regex.

Path traversal guard (F4-1). _validate_docs_root() now rejects docs_dir paths that escape the repository root. A malicious zenzic.toml pointing docs_dir: ../../../etc/ triggers Exit Code 3 (Blood Sentinel) before any file is read. Like the Shield (Exit Code 2), the Blood Sentinel cannot be suppressed or downgraded by any flag. These two non-negotiable exit codes form Zenzic's security perimeter.

🏗️ No subprocesses — now enforced, not aspirational

When I started Zenzic, "No Subprocesses" was a design goal. As of this Release Candidate, it is a verified property of the entire codebase.

The zenzic serve command has been removed entirely — it was the last place where a subprocess could theoretically be spawned. Docusaurus config is parsed as text, not evaluated via Node.js. .gitignore is interpreted in Pure Python, not via git check-ignore. The MkDocs plugin has been relocated to zenzic.integrations.mkdocs and installs separately via pip install "zenzic[mkdocs]", keeping the core free of engine-specific imports.

Zero subprocess.run(). Zero os.system(). Zero shell calls. This makes Zenzic safe to run in any container, any sandbox, any restricted CI environment — without granting it any capabilities beyond reading files.

📊 By the numbers

Metric	Value	Why it matters
Test functions	929	High-granularity validation across parsing, discovery, and security layers
Source code	11,422 LOC	Non-trivial codebase — reflects real architectural scope
Test code	12,927 LOC	~1.13x ratio with source — disciplined testing, not excess
Engine adapters	4	Proven multi-engine support: MkDocs, Docusaurus v3, Zensical, Vanilla
Runtime dependencies	5	Minimal surface area — lower supply chain risk
Subprocess calls	0	Safe in sandboxed CI and restricted environments

On a mid-range CI runner, Zenzic scans 5,000 synthetic files in under a second, single-threaded. The benchmark script is included in the repo — run it yourself with python scripts/benchmark.py --files 5000.

⚠️ Breaking changes

This is a Release Candidate from an alpha series — breaking changes are intentional, not accidental:

zenzic serve removed. Use your engine's native command: mkdocs serve, npx docusaurus start.
MkDocs plugin relocated. From zenzic.plugin to zenzic.integrations.mkdocs.
ExclusionManager is mandatory. No more Optional[ExclusionManager] on scanner/validator entry points.

🏁 Run it against your docs

If your documentation is part of your build pipeline, it deserves the same validation rigour as your source code.

pip install --pre zenzic

# Let Zenzic auto-detect your engine
zenzic lint

# Or specify explicitly
zenzic lint --engine docusaurus
zenzic lint --engine mkdocs