DEV Community

Cover image for Automate Python Manual Extraction: Build End-to-End PDF -> LLM -> SQL Flows with CocoIndex, Ollama, and Postgres
Linghua Jin
Linghua Jin

Posted on

Automate Python Manual Extraction: Build End-to-End PDF -> LLM -> SQL Flows with CocoIndex, Ollama, and Postgres

Overview

We'll demonstrate an end-to-end data extraction pipeline, engineered for full automation, reproducibility, and clarity. Our objective: transform unstructured PDF documentation (like Python's official manuals) into precise, structured, and queryable tables using CocoIndex, Ollama, and state-of-the-art LLMs.


Flow Architecture

  1. Document Parsing: For each PDF file, the pipeline uses a custom parser to convert binary content to markdown, leveraging CPU or GPU acceleration for high throughput.
  2. Structured Data Extraction: With built-in ExtractByLlm functions from CocoIndex, markdown is processed by an LLM running locally (like Meta's Llama 3 via Ollama), yielding structured Python dataclass outputs (ModuleInfo).
  3. Post-Processing & Summarization: The pipeline applies a custom summarization operator, counting and describing structure (number of classes, methods, etc.).
  4. Data Collection & Export: All structured outputs are indexed in PostgreSQL, powering downstream analytics and fast queries.

This extensible approach supports a variety of schemas, formats, and even alternative LLMs through CocoIndex's modular operator system.


Prerequisites and Environment Setup

  • Install Postgres
  • Download and install Ollama. Pull your preferred LLM model:
ollama pull llama3.2
Enter fullscreen mode Exit fullscreen mode

CocoIndex supports Ollama, Gemini, and LiteLLM for on-prem and hybrid cloud use cases.


Step 1: Add a Source (Python PDFs as Input)

Register the "manuals" directory as a binary source:

@cocoindex.flow_def(name="ManualExtraction")
def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="manuals", binary=True)
    )
    modules_index = data_scope.add_collector()
Enter fullscreen mode Exit fullscreen mode

Each source file is automatically indexed as:

  • filename: str (for traceability)
  • content: bytes (for lossless parsing)

Step 2: Convert PDF to Markdown with a Custom Operator

Define a function/executor that handles PDF-to-markdown transformation (pluggable for custom or high-performance parsing):

class PdfToMarkdown(cocoindex.op.FunctionSpec):
    """Convert a PDF to markdown."""

@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
    spec: PdfToMarkdown
    _converter: PdfConverter
    def prepare(self):
        config_parser = ConfigParser({})
        self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())
    def __call__(self, content: bytes) -> str:
        with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
            temp_file.write(content)
            temp_file.flush()
            text, _, _ = text_from_rendered(self._converter(temp_file.name))
            return text
Enter fullscreen mode Exit fullscreen mode

Advantages: resource initialization, GPU support, modular backend, and robust cache.


Step 3: Pass Through LLM for Structured Data Extraction

Define your schema with dataclasses, then extract info using ExtractByLlm:

@dataclasses.dataclass
class MethodInfo:
    name: str
    args: cocoindex.typing.List[str]
    description: str
@dataclasses.dataclass
class ClassInfo:
    name: str
    description: str
    methods: cocoindex.typing.List[MethodInfo]
@dataclasses.dataclass
class ModuleInfo:
    title: str
    description: str
    classes: cocoindex.typing.List[ClassInfo]
    methods: cocoindex.typing.List[MethodInfo]

with data_scope["documents"].row() as doc:
    doc["module_info"] = doc["content"].transform(
        cocoindex.functions.ExtractByLlm(
            llm_spec=cocoindex.LlmSpec(api_type=cocoindex.LlmApiType.OLLAMA, model="llama3.2"),
            output_type=ModuleInfo,
            instruction="Please extract Python module information from the manual.")
    )
Enter fullscreen mode Exit fullscreen mode

Step 4: Summarize and Export Data

Add post-processing, collect, and export to SQL:

@dataclasses.dataclass
class ModuleSummary:
    num_classes: int
    num_methods: int
@cocoindex.op.function()
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
    return ModuleSummary(
        num_classes=len(module_info.classes),
        num_methods=len(module_info.methods),
    )
with data_scope["documents"].row() as doc:
    doc["module_summary"] = doc["module_info"].transform(summarize_module)
modules_index.collect(filename=doc["filename"], module_info=doc["module_info"])
modules_index.export("modules", cocoindex.storages.Postgres(table_name="modules_info"), primary_key_fields=["filename"])
Enter fullscreen mode Exit fullscreen mode

Scale to any number of docs with a single update command:

cocoindex update -L main
Enter fullscreen mode Exit fullscreen mode

Why CocoIndex Is Handy for LLM Workflows

  • Typed contracts for prompt engineering and schema evolution
  • Modular operator graph (e.g., plug in new LLMs or tools seamlessly)
  • Pluggable LLM endpoints (on-prem/cloud)
  • Indexed, versioned data: easy debugging, live regression testing
  • Native support for document retrieval and hybrid agent workflows
  • Full MLOps support: audit logs, CI/CD, rollback, and analytics

Complete Example: Query and Dashboard

Once indexed, all module summaries are available via SQL:

SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
Enter fullscreen mode Exit fullscreen mode

For visual dashboards and data lineage, try CocoInsight.


python #llm #ollama #cocoindex #dataengineering #automation #MLops #devops #pdf #opensource

Top comments (0)