I built a validation pipeline that blocks AI-generated files from reaching disk if they fail schema checks

#ai #llm #opensource #python

The problem

I've been using local LLMs to generate structured Markdown knowledge files — architecture docs, runbooks, API references. After a few hundred files, the knowledge base becomes noise.

Wrong field types. Invalid enum values. Dates in the wrong format. Domains that don't exist in the taxonomy. Dataview queries return nothing. The graph becomes useless.

The issue isn't the model. It's that there's no contract between "LLM output" and "file that reaches disk."

The solution: a validation gate

AKF sits between the LLM and the filesystem:

Prompt → LLM → Validation Engine → Error Normalizer → Retry Controller → Commit Gate → File

LLM generates a Markdown file with YAML frontmatter
Validation Engine checks it — binary VALID/INVALID, typed error codes (E001–E007)
If invalid, Error Normalizer translates errors into correction instructions and sends them back to the LLM
Retry Controller retries up to 3 times — aborts if the same error fires twice (prevents infinite cost loops)
Commit Gate writes atomically — only VALID output reaches disk

Your taxonomy, not mine

The schema is external. You define it in akf.yaml:

schema_version: "1.0.0"
vault_path: "./vault"

taxonomy:
  domain:
    - api-design
    - backend-engineering
    - devops
    - security
  type: [concept, guide, reference, checklist]
  level: [beginner, intermediate, advanced]
  status: [draft, active, completed, archived]

Change the taxonomy, rebuild nothing. The validation engine loads it at runtime.

Usage

pip install ai-knowledge-filler
akf init
akf generate "Create a guide on Docker networking" --provider ollama --model llama3.2

Or via Python API:

from akf import Pipeline

pipeline = Pipeline(output="./vault/", provider="ollama", model="mistral")
result = pipeline.generate("Create a reference for JWT authentication")
print(result.file_path)  # only printed if validation passed

Works with Ollama, Claude, Gemini, GPT-4. MCP server just shipped — Claude Projects can now call the pipeline directly without a human relay.

One unexpected insight

When a domain triggers elevated retries, it's usually not the model failing. It's a signal that the taxonomy boundary is ambiguous — the LLM keeps proposing something that doesn't fit any enum because the enum doesn't match how the concept naturally compresses.

Retry rate becomes a health metric for your schema, not a failure metric for the model.

Error codes

Code	Field	Meaning
E001	type/level/status	Invalid enum value
E002	any	Required field missing
E003	created/updated	Date not ISO 8601
E004	title/tags	Type mismatch
E005	frontmatter	General schema violation
E006	domain	Not in taxonomy
E007	created/updated	created > updated