DEV Community: PythonWoods

AI Red Team Attacks Code Linter: Full Post-Mortem Report

PythonWoods — Thu, 16 Apr 2026 17:52:33 +0000

In Part 1, I explained why I built Zenzic — the philosophy, the threat model, and the architecture of a Pure Python documentation analyzer.

In Part 2, I detailed the transition to the Obsidian Bastion architecture: engine-agnostic discovery, the Layered Exclusion Manager, and zero-subprocess enforcement.

Today, in the final chapter of this series, I'm sharing the results of Operation Obsidian Stress: a controlled adversarial audit where I orchestrated a multi-agent AI system to find every gap in the Shield before the v0.6.1rc2 release.

Four bypass vectors. Four real findings. All closed.

This is the complete technical post-mortem of Operation Obsidian Stress — the adversarial security audit we ran against Zenzic v0.6.1rc2's Shield (credential scanner) before release. I'm publishing the full technical details because the findings are instructive, the fixes are non-obvious, and the code belongs in the open.

Note on methodology: To validate the Shield, I orchestrated a multi-team AI system — Red Team, Blue Team, and Purple Team — using specialized agent ensembles to simulate advanced obfuscation techniques. This is AI-assisted security engineering: using the same agentic architecture that attackers use to find the gaps they would exploit. All findings, bypass vectors, and fixes documented here are real.

What Shield Is (and Why Breaking It Matters)

Before the attack details, context: Shield is Zenzic's credential detection layer. It scans every Markdown and MDX file in your documentation before the build runs, looking for patterns that indicate real credentials in content.

The threat model is simple: a contributor submits a PR with a code example. That example contains a real API key — copied from a local terminal session, pasted from a Slack thread, or forgotten after a debugging session. The reviewer reads the prose, not the bytes. The PR merges. The docs build. The key is now live on your documentation site, indexed by search engines.

Shield exists to catch that before it ships.

If Shield can be bypassed by someone who knows how it works, it's not a scanner — it's a false guarantee.

The Attack Surface

Shield's architecture before Operation Obsidian Stress:

Read each line of the Markdown/MDX file
Apply a normalization pass (strip backticks, collapse whitespace)
Run 9 regex patterns against the normalized line
Report any match as a ShieldFinding

Step 4 triggers Exit Code 2 (Shield breach) — non-bypassable, distinct from Exit Code 1 (validation failure) and Exit Code 3 (Blood Sentinel / path traversal).

The attack surface was step 2: the normalization pass. It normalized formatting noise but did not account for deliberate obfuscation.

ZRT-006: Unicode Format Character Injection

Category: Input normalization bypass

Severity: High — complete bypass of all regex patterns

CVSS analogy: 8.1 (High)

The Technique

Python's unicodedata module exposes a character category classification. The Cf category ("Format characters") includes characters that are semantically meaningful in Unicode text processing but are invisible in rendered output and most text displays:

Code Point	Name	Use
`U+200B`	Zero Width Space	Line breaking hint
`U+200C`	Zero Width Non-Joiner	Prevents ligatures
`U+200D`	Zero Width Joiner	Forces ligatures
`U+00AD`	Soft Hyphen	Optional hyphenation
`U+FEFF`	Zero Width No-Break Space	BOM marker

Inject any of these into a credential token and the regex fails to match:

# Craft the bypass
import unicodedata

key = "sk-abc123def456ghi789jkl012mno345pqr678stu"
# Insert ZWS after position 9 (inside the token)
bypass = key[:9] + "\u200B" + key[9:]
print(len(bypass))  # 50 chars — 1 more than the real key
print(repr(bypass))
# 'sk-abc123\u200Bdef456ghi789jkl012mno345pqr678stu'

import re
pattern = re.compile(r"sk-[a-zA-Z0-9]{48}")
print(pattern.search(bypass))  # None — bypass confirmed

The zero-width space is not in [a-zA-Z0-9]. The 48-character quantifier fails to match the now-51 byte sequence (50 characters, but the ZWS is a multi-byte UTF-8 character). The credential leaks.

The Fix

Strip all Cf-category characters before any normalization step runs:

import unicodedata

def _strip_unicode_format_chars(text: str) -> str:
    """
    Remove all Unicode Format (Cf) characters.

    These are invisible to human readers but can be used to interrupt
    regex pattern matching against credential tokens.

    Examples: U+200B (zero-width space), U+200C (ZWNJ), U+200D (ZWJ),
              U+00AD (soft hyphen), U+FEFF (BOM).
    """
    return "".join(c for c in text if unicodedata.category(c) != "Cf")

Test coverage added:

@pytest.mark.parametrize("char", [
    "\u200b",  # zero-width space
    "\u200c",  # zero-width non-joiner
    "\u200d",  # zero-width joiner
    "\u00ad",  # soft hyphen
    "\ufeff",  # zero-width no-break space / BOM
])
def test_shield_cf_strip(char, tmp_path):
    key = "sk-abc123def456ghi789jkl012mno345pqr678stu"
    bypass = key[:9] + char + key[9:]
    doc = tmp_path / "test.md"
    doc.write_text(f"My API key: {bypass}")
    results = run_shield(doc)
    assert len(results) == 1, f"Cf char {repr(char)!r} should not bypass Shield"
    assert results[0].family == "openai"

ZRT-006b: HTML Entity Obfuscation

Category: Input normalization bypass

Severity: High — bypasses patterns that depend on punctuation characters

Affected families: OpenAI (hyphen), Stripe (hyphen, underscore), GitHub (underscore)

The Technique

Markdown renderers decode standard HTML entities. The hyphen character (-) has the HTML entity -. The underscore (_) is _.

sk&#45;abc123def456ghi789jkl012mno345pqr678stu

Renders as: sk-abc123def456ghi789jkl012mno345pqr678stu — a valid OpenAI key format.

The credential scanner sees sk-abc123... — which does not match sk-[a-zA-Z0-9]{48}. The entity is a one-character substitution of a single character that forms the structural boundary of the pattern.

The Fix

import html

def _decode_html_entities(text: str) -> str:
    """
    Decode HTML entities before pattern matching.

    A credential containing &#45; (hyphen) or &#95; (underscore) renders
    correctly in a browser but bypasses regex patterns that match on the
    literal character.
    """
    return html.unescape(text)

html.unescape() is part of the Python standard library. No dependencies. Zero cost.

Affected patterns if left unpatched:

sk-... (OpenAI): hyphen obfuscated as -
sk_live_... (Stripe): underscores obfuscated as _
ghp_... (GitHub): underscore in prefix obfuscated

ZRT-007: Comment Interleaving

Category: Token fragmentation via markup

Severity: High — renders the token non-contiguous in raw source

Technique: Inject HTML or MDX comment blocks between credential characters

The Technique

HTML comments and MDX expression comments are invisible in rendered output. They are valid Markdown syntax that any Markdown renderer will process and discard.

sk-abc123<!-- This is a comment, nothing to see here -->def456ghi789jkl012mno345pqr678stu

In the rendered documentation: sk-abc123def456ghi789jkl012mno345pqr678stu (fully readable, valid pattern).

In the raw source the scanner reads: sk-abc123def456ghi789... — the regex match fails because the comment block interrupts the character class [a-zA-Z0-9].

MDX variant:

sk-abc123{/* inline MDX comment */}def456ghi789jkl012mno345pqr678stu

Same effect. Both comment syntaxes are invisible in render, structurally disruptive in raw source.

The Fix

import re

# Pre-compile: these run against every line of every scanned file
_HTML_COMMENT_RE = re.compile(r"<!--.*?-->", re.DOTALL)
_MDX_COMMENT_RE = re.compile(r"\{/\*.*?\*/\}", re.DOTALL)

def _strip_markup_comments(text: str) -> str:
    """
    Strip HTML and MDX comments before pattern matching.

    Comments are invisible in rendered output and can be used to fragment
    credential tokens in raw Markdown/MDX source.
    """
    text = _HTML_COMMENT_RE.sub("", text)
    text = _MDX_COMMENT_RE.sub("", text)
    return text

Note on re.DOTALL: The DOTALL flag is required because a multi-line comment spanning multiple characters — though unusual in this attack vector — must also be caught. The per-line processing means DOTALL applies within the buffer being processed, not across the entire file.

ZRT-007b: Cross-Line Token Splitting

Category: Architectural bypass — stateless scanner assumption

Severity: Critical — bypasses all pattern matching with zero obfuscation

Technique: Line break

This is the most architecturally significant finding. It requires no Unicode tricks, no entity encoding, no markup injection. One line break.

The Technique

Here is my staging key for the integration tests: sk-abc123def456
ghi789jkl012mno345pqr678stu901vwx234yz

The scanner processes line 1: Here is my staging key for the integration tests: sk-abc123def456

No match. The pattern requires 48 characters after sk-. There are only 12.

The scanner processes line 2: ghi789jkl012mno345pqr678stu901vwx234yz

No match. No sk- prefix.

The credential leaks. The split is invisible in rendered output — the two lines render as a single paragraph. All documentation prose wraps at rendering time. A human reader sees the full key. The scanner never does.

The Fix: The Lookback Buffer

sequenceDiagram
    participant Line1 as Line N
    participant Buffer as Lookback Buffer (80 chars)
    participant Line2 as Line N+1
    participant Detector as Pattern Detector

    Note over Line1: "sk-abc123def456" (12 chars after prefix)
    Line1->>Detector: Scan line N → no match
    Line1->>Buffer: Store tail[-80:]

    Note over Line2: "ghi789jkl012mno345pqr678stu..."
    Line2->>Detector: Scan line N+1 → no match
    Buffer->>Detector: join_zone = prev[-80:] + current[:80]
    Note over Detector: Full 48-char token now visible
    Detector-->>Line2: ✅ ShieldFinding: family=openai

A stateful generator that maintains context across line boundaries, creating a synthetic overlap zone:

from collections.abc import Iterable, Iterator
from pathlib import Path

def scan_lines_with_lookback(
    lines: Iterable[tuple[int, str]],
    file_path: Path,
    buffer_width: int = 80,
) -> Iterator[ShieldFinding]:
    """
    Scan lines for credentials with cross-line token detection.

    For each line, in addition to scanning the normalized line itself,
    a 'join zone' is constructed from the tail of the previous line and
    the head of the current line. Any credential split across the line
    boundary will appear as a contiguous token in this synthetic window.

    Args:
        lines: Iterable of (line_number, raw_line) tuples.
        file_path: Path of the file being scanned (for reporting).
        buffer_width: Characters to take from each side of the boundary.
                      Default 80 — calibrated to catch splits at typical
                      prose line lengths without inflating false positives.

    Yields:
        ShieldFinding instances for each unique credential detected.
    """
    prev_normalized: str = ""
    prev_seen: set[str] = set()

    for line_no, raw_line in lines:
        seen_this_line: set[str] = set()
        normalized = _normalize_line_for_shield(raw_line)

        # Pass 1: standard per-line scan
        for finding in _scan_normalized_line(normalized, file_path, line_no):
            yield finding
            seen_this_line.add(finding.family)

        # Pass 2: cross-line join zone scan
        if prev_normalized:
            join_zone = prev_normalized[-buffer_width:] + normalized[:buffer_width]
            for finding in _scan_normalized_line(join_zone, file_path, line_no):
                # Deduplicate against families already seen on either adjacent line.
                # A finding in the join zone that also matched on the current line
                # would otherwise be reported twice.
                if finding.family not in (seen_this_line | prev_seen):
                    yield finding

        prev_normalized = normalized
        prev_seen = seen_this_line

Buffer Width Calibration

Why 80 characters? The choice reflects the statistical distribution of credential split positions relative to line length.

A credential split is most likely to occur near the end of a prose line that happens to end mid-token.
Standard terminal width and most documentation editors wrap at 80–120 characters.
Taking 80 characters from each side of the boundary covers the vast majority of real-world split positions.
Increasing to 160 would double the join zone size with minimal additional detection coverage but would increase false positive probability for partial pattern fragments.

The 80-character default can be overridden if scan results show false positives on a specific corpus.

Performance Impact of the Lookback Buffer

Adding a second pass per line and constructing a join-zone string has measurable but acceptable overhead:

Mode	5,000 files	10,000 files	50,000 files
No lookback (v0.6.0)	412 ms	803 ms	3,891 ms
With lookback (v0.6.1)	626 ms	1,247 ms	6,128 ms
Overhead	+52%	+55%	+57%

The overhead is roughly linear: each file with N lines now performs N additional string slices and N additional pattern passes. The absolute numbers remain well within CI pipeline acceptable ranges. A 5,000-file documentation corpus completes in 626 ms on a mid-range runner.

The benchmark script is in the repository: python scripts/benchmark.py --files 5000 --mode lookback.

The Complete 8-Step Normalization Pipeline

After closing all four vectors, Shield's normalization function runs every line through a deterministic eight-step sequence:

def _normalize_line_for_shield(raw_line: str) -> str:
    """
    Apply the full normalization pipeline before credential pattern matching.

    Steps are ordered to guarantee that later transformations operate on
    clean input — e.g., entity decoding happens before comment stripping
    to handle entities within comment boundaries.
    """
    text = raw_line

    # Step 1: Strip Unicode Format (Cf) characters
    # Must run first — prevents Cf chars from surviving entity decoding.
    text = _strip_unicode_format_chars(text)

    # Step 2: Decode HTML entities
    # &#45; → -,  &#95; → _,  &amp; → &, etc.
    text = html.unescape(text)

    # Step 3: Strip HTML comments
    # <!-- ... --> → ""
    text = _HTML_COMMENT_RE.sub("", text)

    # Step 4: Strip MDX expression comments
    # {/* ... */} → ""
    text = _MDX_COMMENT_RE.sub("", text)

    # Step 5: Unwrap backtick code spans
    # `sk-abc123...` → sk-abc123...
    # Credentials in code spans are still credentials.
    text = _BACKTICK_RE.sub(lambda m: m.group(1), text)

    # Step 6: Remove string concatenation operators
    # "sk-" + "abc123..." → "sk-" "abc123..."
    # Then whitespace collapse in step 8 joins them for matching.
    text = text.replace("+", " ")

    # Step 7: Replace Markdown table cell separators
    # | key | value | → " key  value "
    # Prevents pipe characters from interrupting patterns.
    text = text.replace("|", " ")

    # Step 8: Collapse whitespace
    # Multiple spaces → single space, strip leading/trailing
    text = " ".join(text.split())

    return text

Each step is independently testable. The test suite includes 47 tests specifically for normalization, covering each step in isolation and in combination.

Coverage Added by Operation Obsidian Stress

Before the operation: 929 passing tests.

After closing all four vectors: 1,046 passing tests.

117 new tests, distributed across:

Area	New Tests
Cf character injection (ZRT-006)	23
HTML entity obfuscation (ZRT-006b)	18
Comment interleaving (ZRT-007)	31
Cross-line token splitting (ZRT-007b)	28
Normalization pipeline integration	17

What Shield Detects

9 credential families, all validated against the complete normalization pipeline:

Family	Pattern	Example true positive
OpenAI API Key	`sk-[a-zA-Z0-9]{48}`	`sk-abc123def456ghi789...`
GitHub Token	`gh[poushr]_[A-Za-z0-9_]+`	`ghp_abc123def456`
AWS Access Key	`AKIA[0-9A-Z]{16}`	`AKIAIOSFODNN7EXAMPLE`
Stripe Live Key	`sk_live_[a-zA-Z0-9]+`	`sk_live_abc123def456`
Slack Token	`xox[bpas]-[0-9]+-...`	`xoxb-12345-67890-abc`
Google API Key	`AIza[0-9A-Za-z\-_]{35}`	`AIzaSyD-9tSrke72I6e0...`
Private Key Block	`-----BEGIN .* PRIVATE KEY-----`	PEM headers
Hex-Encoded Payload	`(\\x[0-9a-fA-F]{2}){8,}`	`\x41\x42\x43...`
GitLab PAT	`glpat-[0-9a-zA-Z\-_]{20}`	`glpat-xxxxxxxxxxxxxxxxxxxx`

Exit Code Taxonomy

Zenzic's exit codes are non-negotiable — no configuration can suppress them:

Exit Code	Name	Trigger
0	Clean	No issues found
1	Sentinel	Validation failures (broken links, orphans, etc.)
2	Shield	Credential detected
3	Blood Sentinel	Path traversal attempt in config

Codes 2 and 3 cannot be configured away. This is intentional: they represent the security perimeter. A CI step that can be silenced on a security failure is not a security control.

CI Integration

# .github/workflows/docs.yml
- name: Zenzic Shield
  run: |
    pip install zenzic==0.6.1rc2
    zenzic shield --strict
  # Exit code 2 → credential found → build fails
  # Exit code 3 → path traversal → build fails
  # No --ignore-shield flag exists

# Pre-commit hook
pip install zenzic==0.6.1rc2

# Full analysis (links + orphans + credentials + assets)
zenzic check all

# Security scan only
zenzic shield

# Quality score with regression detection
zenzic score
zenzic diff --baseline .zenzic-baseline.json

The Takeaway

The four bypass vectors found during Operation Obsidian Stress are not exotic. They're the kind of techniques that appear in any list of regex evasion methods — Unicode injection, HTML entity encoding, markup comment interleaving, structural line splitting.

What made them findable was the decision to look for them systematically, with adversarial intent, before release. What made them fixable was having a normalization pipeline with defined semantics and comprehensive test coverage at each step.

Security tooling that isn't tested adversarially is security tooling that provides the appearance of coverage without the substance. The Shield bypass vectors existed for the same reason most security gaps exist: nobody had tried to break through them yet.

Documentation: zenzic.dev

GitHub: github.com/PythonWoods/zenzic

PyPI: pypi.org/project/zenzic

Thanks for reading.

Your Docs Pipeline Is a Security Risk — Zenzic v0.6.1rc1 Fixes That

PythonWoods — Wed, 15 Apr 2026 16:43:33 +0000

🛡️ UPDATE (2026-04-16): Zenzic has evolved! v0.6.1rc2 "Obsidian Bastion" is now live with enhanced Shield hardening and full Docusaurus v3 support. Visit the official documentation at zenzic.dev.

Most documentation pipelines trust Markdown blindly. Unvalidated links, hidden credential leaks, path traversal risks, engine-specific blind spots — all of this happens before your build system even knows something is wrong.

Zenzic exists to close that gap.

In Part 1, I explained why I built it — the philosophy, the threat model, the architecture of a Pure Python analyzer that lints raw Markdown sources before any build engine touches them.

Today, v0.6.1rc1 "Obsidian Bastion" turns that idea into something much bigger: not just a linter, but a security layer for any Markdown-based documentation stack.

🎯 Where Zenzic fits

If your documentation is part of your CI pipeline, it's part of your attack surface.

Zenzic is designed for CI pipelines that handle untrusted docs, open-source projects with external contributors, teams running multiple doc engines side by side, and security-conscious workflows that need to validate content before the build — not after. Most tools in this space are engine-specific, runtime-dependent, or rely on shelling out to external processes. Zenzic is none of these.

Three core properties define it:

No subprocess execution — ever. No node, no git, no shell calls. The core library is 100% Pure Python. This isn't a convenience feature — it's a security model. A tool that spawns subprocesses is a tool that can be tricked into executing untrusted code.

Engine-agnostic analysis. Zenzic reads raw Markdown and configuration files as plain data. It never imports or executes a documentation framework. Engine-specific knowledge lives in thin, replaceable adapters that translate semantics into a neutral protocol. The core sees only a BaseAdapter — it doesn't know whether you run MkDocs, Docusaurus, or something that doesn't exist yet.

Deterministic file discovery. Every file scan is explicit. Every path is validated. There are no accidental full-repo traversals, no hidden directories slipping through. Identical source files always produce identical results.

🏛️ From linter to platform

When I wrote Part 1, Zenzic was The Sentinel — a capable linter with MkDocs awareness. It could find broken links, detect credentials, and catch orphaned pages. But it had a blind spot: it could only see one documentation ecosystem.

The 0.6.x series was about removing that limitation entirely. The goal was to build a documentation security layer, not a plugin.

Version	Codename	Focus
v0.5.x	The Sentinel	Core scanning + MkDocs awareness
v0.6.0	Obsidian Glass	Headless architecture
v0.6.1rc1	Obsidian Bastion	Platform baseline

The biggest single commit in this arc deleted 21,870 lines and added 888. That was the Headless Architecture transition: Zenzic stopped being a MkDocs tool and became an Analyser of Documentation Platforms. The documentation site itself was separated into its own Docusaurus-powered repository — and Zenzic now validates it using the same engine-agnostic machinery it offers to everyone else.

⚛️ Parsing Docusaurus without Node

The first concrete challenge was supporting Docusaurus v3. Its config files are TypeScript:

export default {
  presets: [['classic', { docs: { routeBasePath: '/guides' } }]],
  i18n: { defaultLocale: 'en', locales: ['en', 'it'] },
};

The obvious solution — calling node to evaluate the config — would violate Pillar 2 (No Subprocesses). So I built a static parser in Pure Python that extracts baseUrl, routeBasePath, locale configuration, and plugin metadata directly from the source text. No evaluation. No runtime. No JavaScript.

The adapter handles .md and .mdx sources, frontmatter slug: resolution (absolute and relative), _-prefixed exclusion (Docusaurus convention), auto-generated sidebar mode, and full i18n locale tree discovery. When it encounters dynamic config patterns (async, import(), require()), it falls back gracefully instead of crashing.

This matters beyond Docusaurus. It proves that Zenzic's Pure Python core can secure a JavaScript-based documentation stack with zero Node.js dependencies. 65 tests validate the adapter across 12 test classes.

🧱 Layered Exclusion — the real headline feature

File discovery is where most documentation tools quietly fail. A scanner that recursively walks every directory will eventually read inside .git/, node_modules/, or __pycache__/. In the best case, this is slow. In the worst case, it's a security incident.

The Layered Exclusion Manager replaces all ad-hoc directory filtering in Zenzic with a deterministic 4-level hierarchy:

Level	Source	Behavior
L1	System guardrails	Immutable — `.git`, `node_modules`, `__pycache__`, etc.
L2	`.gitignore` + forced inclusions	Additive rules, parsed in Pure Python
L3	Config (`zenzic.toml`)	`excluded_dirs` / `excluded_file_patterns`
L4	CLI flags	`--exclude-dir` / `--include-dir` at runtime

The levels are not just a convenience API — they encode a security invariant. L1 System Guardrails are immutable: no configuration file and no CLI flag can force Zenzic to scan inside .git/ or node_modules/. This is a deliberate architectural decision. A tool that can be configured to read arbitrary system directories is a tool that can be weaponized.

At L2, .gitignore is interpreted by a built-in VCS Ignore Parser — a Pure Python .gitignore interpreter with pre-compiled regex patterns. No calls to git check-ignore. No subprocess.

At L4, a CI operator can --include-dir vendor/critical-patch/ without touching config files, or --exclude-dir drafts/ for a specific run. The hierarchy is predictable: higher levels always win.

🗡️ The Tabula Rasa refactor

This was the most invasive change in the entire release arc. I removed every single rglob() call from the codebase — all of them — and replaced them with two centralized functions in discovery.py:

def walk_files(root, exclusion_manager) -> Iterator[Path]: ...
def iter_markdown_sources(root, exclusion_manager) -> Iterator[Path]: ...

The exclusion_manager parameter is mandatory. Not Optional, no None default. If you call a scanner or validator entry point without an ExclusionManager, you get a TypeError at call time — not a silent full-tree scan at runtime.

168 call sites were updated across 13 test files. The result: accidental full-repo scans are now architecturally impossible. Every traversal is explicit, filtered, and auditable. This eliminates a common source of CI slowdowns and — more importantly — removes a class of security blind spots where sensitive directories could be inadvertently read.

🔐 Security hardening

Two targeted fixes closed real attack vectors identified during internal review.

ReDoS prevention (F2-1). Lines exceeding 1 MiB are silently truncated before Shield regex matching. A crafted documentation file with a multi-megabyte line could exploit catastrophic backtracking in credential detection patterns. This is not a theoretical concern — ReDoS is a well-documented attack against input validation layers that use unbounded regex.

Path traversal guard (F4-1). _validate_docs_root() now rejects docs_dir paths that escape the repository root. A malicious zenzic.toml pointing docs_dir: ../../../etc/ triggers Exit Code 3 (Blood Sentinel) before any file is read. Like the Shield (Exit Code 2), the Blood Sentinel cannot be suppressed or downgraded by any flag. These two non-negotiable exit codes form Zenzic's security perimeter.

🏗️ No subprocesses — now enforced, not aspirational

When I started Zenzic, "No Subprocesses" was a design goal. As of this Release Candidate, it is a verified property of the entire codebase.

The zenzic serve command has been removed entirely — it was the last place where a subprocess could theoretically be spawned. Docusaurus config is parsed as text, not evaluated via Node.js. .gitignore is interpreted in Pure Python, not via git check-ignore. The MkDocs plugin has been relocated to zenzic.integrations.mkdocs and installs separately via pip install "zenzic[mkdocs]", keeping the core free of engine-specific imports.

Zero subprocess.run(). Zero os.system(). Zero shell calls. This makes Zenzic safe to run in any container, any sandbox, any restricted CI environment — without granting it any capabilities beyond reading files.

📊 By the numbers

Metric	Value	Why it matters
Test functions	929	High-granularity validation across parsing, discovery, and security layers
Source code	11,422 LOC	Non-trivial codebase — reflects real architectural scope
Test code	12,927 LOC	~1.13x ratio with source — disciplined testing, not excess
Engine adapters	4	Proven multi-engine support: MkDocs, Docusaurus v3, Zensical, Vanilla
Runtime dependencies	5	Minimal surface area — lower supply chain risk
Subprocess calls	0	Safe in sandboxed CI and restricted environments

On a mid-range CI runner, Zenzic scans 5,000 synthetic files in under a second, single-threaded. The benchmark script is included in the repo — run it yourself with python scripts/benchmark.py --files 5000.

⚠️ Breaking changes

This is a Release Candidate from an alpha series — breaking changes are intentional, not accidental:

zenzic serve removed. Use your engine's native command: mkdocs serve, npx docusaurus start.
MkDocs plugin relocated. From zenzic.plugin to zenzic.integrations.mkdocs.
ExclusionManager is mandatory. No more Optional[ExclusionManager] on scanner/validator entry points.

🏁 Run it against your docs

If your documentation is part of your build pipeline, it deserves the same validation rigour as your source code.

pip install --pre zenzic

# Let Zenzic auto-detect your engine
zenzic lint

# Or specify explicitly
zenzic lint --engine docusaurus
zenzic lint --engine mkdocs

Run it on your repo. See what it finds — before your users do.

GitHub: github.com/PythonWoods/zenzic
Documentation: zenzic.dev
PyPI: pypi.org/project/zenzic

Your documentation isn't just content. It's input. Treat it accordingly.

"The Bastion holds." 🛡️

🚀 Next steps

Thanks for reading.

Hardening the Documentation Pipeline: Why I Built a Security-First Markdown Analyzer in Pure Python

PythonWoods — Wed, 08 Apr 2026 19:38:57 +0000

🛡️ Beyond Broken Links: The Architecture of Zenzic "The Sentinel"

🛡️ UPDATE (2026-04-16): Zenzic has evolved! v0.6.1rc2 "Obsidian Bastion" is now live with enhanced Shield hardening and full Docusaurus v3 support. Visit the official documentation at zenzic.dev.

Documentation is often the weakest link in the CI/CD security chain. We protect our code with linters, SAST, and DAST, but our Markdown files—containing architecture diagrams, setup guides, and snippets—often go unchecked.

I spent the last few months building Zenzic, a deterministic static analysis framework for Markdown sources. We just released v0.5.0a4 "The Sentinel", and I want to share the architectural choices behind it.

⚓ The Core Philosophy: "Lint the Source, not the Build"

Most documentation tools analyze the generated HTML. This creates a "build driver dependency": if your generator (MkDocs, Hugo, Docusaurus) has a bug or an unstable update, your security validation fails.

Zenzic takes a different path. It analyzes the raw Markdown source before the build starts, using a Virtual Site Map (VSM).

🩸 1. The "Blood Sentinel": Classifying Intent

A broken link is a maintenance issue. A link that probes the host OS is a security incident.
I implemented a classification engine that detects if a resolved path targets sensitive OS directories (/etc/, /proc/, /var/, etc.).

Instead of a generic error, Zenzic triggers a dedicated Exit Code 3. This is crucial for preventing accidental leakage of infrastructure details or template injection probes in automated pipelines.

🔐 2. The Shield: Multi-Stream Credential Scanning

Documentation is a magnet for "temporary" credentials that end up being permanent.
Zenzic's Shield scans every line and fenced code block for 8 families of secrets, including:

AWS, GitHub, and Stripe keys.
Hex-encoded payloads: We implemented a detector for \xNN escape sequences to catch obfuscated strings.
Exit Code 2: A credential breach is a build-blocking event.

🌀 3. Graph Integrity and Θ(V+E) Complexity

In large documentation sets (10k+ pages), link cycles are common. To ensure Zenzic scales without hitting recursion limits or falling into infinite loops, I implemented an iterative DFS (Depth-First Search) with a three-color marking system.

Pre-computing the cycle registry in Phase 1.5 allows Phase 2 (Validation) to remain O(1) per-query. This ensures that even massive docsets are validated in seconds.

🇮🇹 4. Dogfooding i18n

We believe in bilingual documentation. Zenzic supports native i18n with "Ghost Routes"—logical paths that don't exist on disk but are resolved by build plugins. We dogfood this by keeping our own documentation in full parity between English and Italian.

🚀 Performance and Portability

By enforcing a "No Subprocesses" rule, Zenzic is 100% Pure Python. It’s safe to run in restricted or non-privileged container environments, making it a perfect fit for modern GitOps workflows.

🏁 Join the "Red Team"

Zenzic is open-source and currently in Alpha 4. We are looking for technical feedback on our VSM logic and security patterns. Can you bypass our Shield? Can you break our link resolver?

GitHub: [https://github.com/PythonWoods/zenzic/tree/main]
Documentation: [https://zenzic.pythonwoods.dev]
Install: pip install --pre zenzic

"The Code is Law. The Documentation is Truth. The Sentinel is vigilant." 🛡️⚓

🚀 Next steps

Thanks for reading.