Linghua Jin

Posted on Dec 4

Automate Python Manual Extraction: Build End-to-End PDF -> LLM -> SQL Flows with CocoIndex, Ollama, and Postgres

#python #llm #dataengineering #etl

Overview

We'll demonstrate an end-to-end data extraction pipeline, engineered for full automation, reproducibility, and clarity. Our objective: transform unstructured PDF documentation (like Python's official manuals) into precise, structured, and queryable tables using CocoIndex, Ollama, and state-of-the-art LLMs.

Flow Architecture

Document Parsing: For each PDF file, the pipeline uses a custom parser to convert binary content to markdown, leveraging CPU or GPU acceleration for high throughput.
Structured Data Extraction: With built-in ExtractByLlm functions from CocoIndex, markdown is processed by an LLM running locally (like Meta's Llama 3 via Ollama), yielding structured Python dataclass outputs (ModuleInfo).
Post-Processing & Summarization: The pipeline applies a custom summarization operator, counting and describing structure (number of classes, methods, etc.).
Data Collection & Export: All structured outputs are indexed in PostgreSQL, powering downstream analytics and fast queries.

This extensible approach supports a variety of schemas, formats, and even alternative LLMs through CocoIndex's modular operator system.

Prerequisites and Environment Setup

Install Postgres
Download and install Ollama. Pull your preferred LLM model:

ollama pull llama3.2

CocoIndex supports Ollama, Gemini, and LiteLLM for on-prem and hybrid cloud use cases.

Step 1: Add a Source (Python PDFs as Input)

@cocoindex.flow_def(name="ManualExtraction")
def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="manuals", binary=True)
    )
    modules_index = data_scope.add_collector()

Each source file is automatically indexed as:

filename: str (for traceability)
content: bytes (for lossless parsing)

Step 2: Convert PDF to Markdown with a Custom Operator

Define a function/executor that handles PDF-to-markdown transformation (pluggable for custom or high-performance parsing):

class PdfToMarkdown(cocoindex.op.FunctionSpec):
    """Convert a PDF to markdown."""

@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
    spec: PdfToMarkdown
    _converter: PdfConverter
    def prepare(self):
        config_parser = ConfigParser({})
        self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())
    def __call__(self, content: bytes) -> str:
        with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
            temp_file.write(content)
            temp_file.flush()
            text, _, _ = text_from_rendered(self._converter(temp_file.name))
            return text

Advantages: resource initialization, GPU support, modular backend, and robust cache.

Step 3: Pass Through LLM for Structured Data Extraction

Define your schema with dataclasses, then extract info using ExtractByLlm:

@dataclasses.dataclass
class MethodInfo:
    name: str
    args: cocoindex.typing.List[str]
    description: str
@dataclasses.dataclass
class ClassInfo:
    name: str
    description: str
    methods: cocoindex.typing.List[MethodInfo]
@dataclasses.dataclass
class ModuleInfo:
    title: str
    description: str
    classes: cocoindex.typing.List[ClassInfo]
    methods: cocoindex.typing.List[MethodInfo]

with data_scope["documents"].row() as doc:
    doc["module_info"] = doc["content"].transform(
        cocoindex.functions.ExtractByLlm(
            llm_spec=cocoindex.LlmSpec(api_type=cocoindex.LlmApiType.OLLAMA, model="llama3.2"),
            output_type=ModuleInfo,
            instruction="Please extract Python module information from the manual.")
    )

Step 4: Summarize and Export Data

Add post-processing, collect, and export to SQL:

@dataclasses.dataclass
class ModuleSummary:
    num_classes: int
    num_methods: int
@cocoindex.op.function()
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
    return ModuleSummary(
        num_classes=len(module_info.classes),
        num_methods=len(module_info.methods),
    )
with data_scope["documents"].row() as doc:
    doc["module_summary"] = doc["module_info"].transform(summarize_module)
modules_index.collect(filename=doc["filename"], module_info=doc["module_info"])
modules_index.export("modules", cocoindex.storages.Postgres(table_name="modules_info"), primary_key_fields=["filename"])

Scale to any number of docs with a single update command:

cocoindex update -L main

Why CocoIndex Is Handy for LLM Workflows

Typed contracts for prompt engineering and schema evolution
Modular operator graph (e.g., plug in new LLMs or tools seamlessly)
Pluggable LLM endpoints (on-prem/cloud)
Indexed, versioned data: easy debugging, live regression testing
Native support for document retrieval and hybrid agent workflows
Full MLOps support: audit logs, CI/CD, rollback, and analytics

Complete Example: Query and Dashboard

Once indexed, all module summaries are available via SQL:

SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;

For visual dashboards and data lineage, try CocoInsight.

python #llm #ollama #cocoindex #dataengineering #automation #MLops #devops #pdf #opensource

DEV Community