Building a Markdown-to-JSON Pipeline with Structured LLM Output

#ai #llm #python #tutorial

You have hundreds of markdown documents — README files, changelogs, internal wikis — and you need to extract structured data from them: version numbers, author names, feature lists, breaking changes. Manually parsing this is brittle; regex breaks the moment someone adjusts the formatting. Language models can read the content, but without enforced structure, the output is unpredictable in production.

This article walks through a Python pipeline that takes arbitrary markdown, sends it to an LLM, and reliably returns validated JSON. We'll cover schema design, prompt engineering for structured output, response validation with Pydantic, and error handling that doesn't silently pass bad data downstream.

Why Not Just Use a Markdown Parser?

Standard parsers like markdown-it or mistletoe work well for HTML conversion, but they can't extract semantic content. Given this changelog entry:

## v2.3.1 — 2026-04-15

**Breaking changes:**
- Removed `legacy_auth` parameter from `/api/login`
- `POST /api/users` now requires `email` field

**Fixes:**
- Fixed race condition in session handler
- Corrected 500 error on empty request body

A parser tells you there's an h2, a bold element, and list items. It has no idea which item is a version number, which is a breaking change, and which is a fix. That's a semantic extraction problem — and language models handle it well.

The alternative — writing custom regex or heuristics — works until the document format drifts. One extra blank line or a slightly different heading convention is enough to break it. An LLM-based approach is more resilient to formatting variation, as long as you enforce the output schema on your end.

Designing the Output Schema

Start with the exact shape you need. Pydantic makes the schema explicit and gives you validation for free.

from pydantic import BaseModel, Field
from typing import Optional
from datetime import date

class ChangelogEntry(BaseModel):
    version: str = Field(..., pattern=r"^\d+\.\d+\.\d+$")
    release_date: Optional[date] = None
    breaking_changes: list[str] = Field(default_factory=list)
    fixes: list[str] = Field(default_factory=list)
    features: list[str] = Field(default_factory=list)

class ChangelogDocument(BaseModel):
    entries: list[ChangelogEntry]

Keep the schema minimal. Every field you add is a surface for hallucination — the model filling in values not present in the source. Only define fields the actual documents contain.

ChangelogDocument.model_json_schema() returns a JSON Schema dict you can embed directly in the prompt. This is more reliable than describing the schema in plain English, because the model sees the exact field names, types, and constraints.

Prompting for Reliable JSON Output

Instruct the model to return only JSON, include the schema explicitly, and forbid markdown code fences.

import json

SYSTEM_PROMPT = """You are a document parser. Extract structured data from markdown.
Output ONLY valid JSON matching the provided schema.
Do NOT wrap output in markdown code blocks.
If a field is absent from the document, use null or an empty list."""

def build_prompt(schema: dict, markdown_content: str) -> str:
    return f"""Extract data from this markdown document.

Schema:
{json.dumps(schema, indent=2)}

Document:
{markdown_content}

Output only the JSON object. No explanations, no code fences."""

Two things matter here. First, embed the schema as JSON — not a prose description — this cuts schema mismatch errors significantly. Second, explicitly forbid code fences. Models frequently wrap responses in triple backticks even when instructed otherwise, so strip them defensively in your parsing layer regardless.

The Extraction and Validation Layer

This is where most pipelines fail: they call the model, assume valid JSON, and crash or silently corrupt data downstream.

import re
from typing import Type, TypeVar
from pydantic import BaseModel, ValidationError

T = TypeVar("T", bound=BaseModel)

def strip_fences(raw: str) -> str:
    match = re.search(r"```

(?:json)?\s*([\s\S]*?)

```", raw)
    return match.group(1).strip() if match else raw.strip()

def parse_llm_output(raw_response: str, model_class: Type[T]) -> T:
    cleaned = strip_fences(raw_response)
    try:
        data = json.loads(cleaned)
    except json.JSONDecodeError as e:
        raise ValueError(f"Model returned invalid JSON: {e}\nRaw: {cleaned[:300]}")
    try:
        return model_class(**data)
    except ValidationError as e:
        raise ValueError(f"Schema validation failed: {e}")

def run_pipeline(llm_client, markdown_content: str) -> ChangelogDocument:
    schema = ChangelogDocument.model_json_schema()
    response = llm_client.complete(
        system=SYSTEM_PROMPT,
        user=build_prompt(schema, markdown_content),
        temperature=0.0,
        max_tokens=2048,
    )
    return parse_llm_output(response.text, ChangelogDocument)

Set temperature=0.0 for extraction tasks. You want deterministic output — any variation introduces noise in parsing behavior and makes debugging harder. Always log the raw response on failure; without it you can't tell whether the model regressed or your prompt did.

Handling Failures with Self-Correction

Structured pipelines fail in two predictable ways: malformed JSON, or valid JSON that doesn't match the schema. Both are recoverable — but feed the error back to the model rather than retrying with the same input.

def run_with_retry(
    llm_client, markdown_content: str, max_retries: int = 2
) -> ChangelogDocument:
    context = markdown_content
    last_error = None

    for attempt in range(max_retries + 1):
        try:
            return run_pipeline(llm_client, context)
        except ValueError as e:
            last_error = e
            if attempt < max_retries:
                context = (
                    f"Your previous output was invalid: {str(e)}\n"
                    f"Fix the error and output only valid JSON.\n\n"
                    f"Original document:\n{markdown_content}"
                )

    raise RuntimeError(
        f"Pipeline failed after {max_retries + 1} attempts: {last_error}"
    )

Feeding the specific validation error into the next prompt noticeably improves correction rates. The model can address a missing required field or a type mismatch when told exactly what failed. Cap retries at 2; if it's still broken after three attempts, the document is likely outside the schema's scope. Log it and move on rather than burning tokens in a loop.

For observability patterns around LLM pipeline failures — logging schemas, alerting thresholds, audit trails — the security and operational checklists at AYI NEDJIMI Consultants cover these requirements in a format you can adapt for production workloads.

The Takeaway

Reliable structured output from a language model requires enforcement at multiple layers: schema-in-prompt, JSON-only output instruction, fence stripping, json.loads validation, and Pydantic schema validation. Skip any one of these and you'll eventually get a silent failure or corrupted downstream data.

This pattern generalizes well beyond changelogs. It applies to any semi-structured markdown: API documentation, meeting notes turned into action items, security reports normalized into findings. The discipline stays the same — keep the schema minimal, validate everything that comes back, and log failures with full context so the prompt can be improved over time.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.