DEV Community

PythonWoods
PythonWoods

Posted on

Hardening the Documentation Pipeline: Why I Built a Security-First Markdown Analyzer in Pure Python

🛡️ Beyond Broken Links: The Architecture of Zenzic "The Sentinel"

Documentation is often the weakest link in the CI/CD security chain. We protect our code with linters, SAST, and DAST, but our Markdown files—containing architecture diagrams, setup guides, and snippets—often go unchecked.

I spent the last few months building Zenzic, a deterministic static analysis framework for Markdown sources. We just released v0.5.0a4 "The Sentinel", and I want to share the architectural choices behind it.

⚓ The Core Philosophy: "Lint the Source, not the Build"

Most documentation tools analyze the generated HTML. This creates a "build driver dependency": if your generator (MkDocs, Hugo, Docusaurus) has a bug or an unstable update, your security validation fails.

Zenzic takes a different path. It analyzes the raw Markdown source before the build starts, using a Virtual Site Map (VSM).

🩸 1. The "Blood Sentinel": Classifying Intent

A broken link is a maintenance issue. A link that probes the host OS is a security incident.
I implemented a classification engine that detects if a resolved path targets sensitive OS directories (/etc/, /proc/, /var/, etc.).

Instead of a generic error, Zenzic triggers a dedicated Exit Code 3. This is crucial for preventing accidental leakage of infrastructure details or template injection probes in automated pipelines.

🔐 2. The Shield: Multi-Stream Credential Scanning

Documentation is a magnet for "temporary" credentials that end up being permanent.
Zenzic's Shield scans every line and fenced code block for 8 families of secrets, including:

  • AWS, GitHub, and Stripe keys.
  • Hex-encoded payloads: We implemented a detector for \xNN escape sequences to catch obfuscated strings.
  • Exit Code 2: A credential breach is a build-blocking event.

🌀 3. Graph Integrity and Θ(V+E) Complexity

In large documentation sets (10k+ pages), link cycles are common. To ensure Zenzic scales without hitting recursion limits or falling into infinite loops, I implemented an iterative DFS (Depth-First Search) with a three-color marking system.

Pre-computing the cycle registry in Phase 1.5 allows Phase 2 (Validation) to remain O(1) per-query. This ensures that even massive docsets are validated in seconds.

🇮🇹 4. Dogfooding i18n

We believe in bilingual documentation. Zenzic supports native i18n with "Ghost Routes"—logical paths that don't exist on disk but are resolved by build plugins. We dogfood this by keeping our own documentation in full parity between English and Italian.


🚀 Performance and Portability

By enforcing a "No Subprocesses" rule, Zenzic is 100% Pure Python. It’s safe to run in restricted or non-privileged container environments, making it a perfect fit for modern GitOps workflows.

Zenzic Sentinel Report

🏁 Join the "Red Team"

Zenzic is open-source and currently in Alpha 4. We are looking for technical feedback on our VSM logic and security patterns. Can you bypass our Shield? Can you break our link resolver?

"The Code is Law. The Documentation is Truth. The Sentinel is vigilant." 🛡️⚓

Top comments (0)