DEV Community

Petr Nazarenko
Petr Nazarenko

Posted on

I built a validation pipeline that blocks AI-generated files from reaching disk if they fail schema checks

The problem

I've been using local LLMs to generate structured Markdown knowledge files — architecture docs, runbooks, API references. After a few hundred files, the knowledge base becomes noise.

Wrong field types. Invalid enum values. Dates in the wrong format. Domains that don't exist in the taxonomy. Dataview queries return nothing. The graph becomes useless.

The issue isn't the model. It's that there's no contract between "LLM output" and "file that reaches disk."

The solution: a validation gate

AKF sits between the LLM and the filesystem:

Prompt → LLM → Validation Engine → Error Normalizer → Retry Controller → Commit Gate → File
Enter fullscreen mode Exit fullscreen mode
  1. LLM generates a Markdown file with YAML frontmatter
  2. Validation Engine checks it — binary VALID/INVALID, typed error codes (E001–E007)
  3. If invalid, Error Normalizer translates errors into correction instructions and sends them back to the LLM
  4. Retry Controller retries up to 3 times — aborts if the same error fires twice (prevents infinite cost loops)
  5. Commit Gate writes atomically — only VALID output reaches disk

Your taxonomy, not mine

The schema is external. You define it in akf.yaml:

schema_version: "1.0.0"
vault_path: "./vault"

taxonomy:
  domain:
    - api-design
    - backend-engineering
    - devops
    - security
  type: [concept, guide, reference, checklist]
  level: [beginner, intermediate, advanced]
  status: [draft, active, completed, archived]
Enter fullscreen mode Exit fullscreen mode

Change the taxonomy, rebuild nothing. The validation engine loads it at runtime.

Usage

pip install ai-knowledge-filler
akf init
akf generate "Create a guide on Docker networking" --provider ollama --model llama3.2
Enter fullscreen mode Exit fullscreen mode

Or via Python API:

from akf import Pipeline

pipeline = Pipeline(output="./vault/", provider="ollama", model="mistral")
result = pipeline.generate("Create a reference for JWT authentication")
print(result.file_path)  # only printed if validation passed
Enter fullscreen mode Exit fullscreen mode

Works with Ollama, Claude, Gemini, GPT-4. MCP server just shipped — Claude Projects can now call the pipeline directly without a human relay.

One unexpected insight

When a domain triggers elevated retries, it's usually not the model failing. It's a signal that the taxonomy boundary is ambiguous — the LLM keeps proposing something that doesn't fit any enum because the enum doesn't match how the concept naturally compresses.

Retry rate becomes a health metric for your schema, not a failure metric for the model.

Error codes

Code Field Meaning
E001 type/level/status Invalid enum value
E002 any Required field missing
E003 created/updated Date not ISO 8601
E004 title/tags Type mismatch
E005 frontmatter General schema violation
E006 domain Not in taxonomy
E007 created/updated created > updated

Links


Are you dealing with schema drift in AI-generated content? What's your current approach — post-hoc review, Pydantic models, something else?

Top comments (0)