Overview
We'll demonstrate an end-to-end data extraction pipeline, engineered for full automation, reproducibility, and clarity. Our objective: transform unstructured PDF documentation (like Python's official manuals) into precise, structured, and queryable tables using CocoIndex, Ollama, and state-of-the-art LLMs.
Flow Architecture
- Document Parsing: For each PDF file, the pipeline uses a custom parser to convert binary content to markdown, leveraging CPU or GPU acceleration for high throughput.
-
Structured Data Extraction: With built-in
ExtractByLlmfunctions from CocoIndex, markdown is processed by an LLM running locally (like Meta's Llama 3 via Ollama), yielding structured Python dataclass outputs (ModuleInfo). - Post-Processing & Summarization: The pipeline applies a custom summarization operator, counting and describing structure (number of classes, methods, etc.).
- Data Collection & Export: All structured outputs are indexed in PostgreSQL, powering downstream analytics and fast queries.
This extensible approach supports a variety of schemas, formats, and even alternative LLMs through CocoIndex's modular operator system.
Prerequisites and Environment Setup
ollama pull llama3.2
CocoIndex supports Ollama, Gemini, and LiteLLM for on-prem and hybrid cloud use cases.
Step 1: Add a Source (Python PDFs as Input)
Register the "manuals" directory as a binary source:
@cocoindex.flow_def(name="ManualExtraction")
def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="manuals", binary=True)
)
modules_index = data_scope.add_collector()
Each source file is automatically indexed as:
-
filename: str(for traceability) -
content: bytes(for lossless parsing)
Step 2: Convert PDF to Markdown with a Custom Operator
Define a function/executor that handles PDF-to-markdown transformation (pluggable for custom or high-performance parsing):
class PdfToMarkdown(cocoindex.op.FunctionSpec):
"""Convert a PDF to markdown."""
@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
spec: PdfToMarkdown
_converter: PdfConverter
def prepare(self):
config_parser = ConfigParser({})
self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())
def __call__(self, content: bytes) -> str:
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
temp_file.write(content)
temp_file.flush()
text, _, _ = text_from_rendered(self._converter(temp_file.name))
return text
Advantages: resource initialization, GPU support, modular backend, and robust cache.
Step 3: Pass Through LLM for Structured Data Extraction
Define your schema with dataclasses, then extract info using ExtractByLlm:
@dataclasses.dataclass
class MethodInfo:
name: str
args: cocoindex.typing.List[str]
description: str
@dataclasses.dataclass
class ClassInfo:
name: str
description: str
methods: cocoindex.typing.List[MethodInfo]
@dataclasses.dataclass
class ModuleInfo:
title: str
description: str
classes: cocoindex.typing.List[ClassInfo]
methods: cocoindex.typing.List[MethodInfo]
with data_scope["documents"].row() as doc:
doc["module_info"] = doc["content"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(api_type=cocoindex.LlmApiType.OLLAMA, model="llama3.2"),
output_type=ModuleInfo,
instruction="Please extract Python module information from the manual.")
)
Step 4: Summarize and Export Data
Add post-processing, collect, and export to SQL:
@dataclasses.dataclass
class ModuleSummary:
num_classes: int
num_methods: int
@cocoindex.op.function()
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
return ModuleSummary(
num_classes=len(module_info.classes),
num_methods=len(module_info.methods),
)
with data_scope["documents"].row() as doc:
doc["module_summary"] = doc["module_info"].transform(summarize_module)
modules_index.collect(filename=doc["filename"], module_info=doc["module_info"])
modules_index.export("modules", cocoindex.storages.Postgres(table_name="modules_info"), primary_key_fields=["filename"])
Scale to any number of docs with a single update command:
cocoindex update -L main
Why CocoIndex Is Handy for LLM Workflows
- Typed contracts for prompt engineering and schema evolution
- Modular operator graph (e.g., plug in new LLMs or tools seamlessly)
- Pluggable LLM endpoints (on-prem/cloud)
- Indexed, versioned data: easy debugging, live regression testing
- Native support for document retrieval and hybrid agent workflows
- Full MLOps support: audit logs, CI/CD, rollback, and analytics
Complete Example: Query and Dashboard
Once indexed, all module summaries are available via SQL:
SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
For visual dashboards and data lineage, try CocoInsight.
Top comments (0)