Linghua Jin

Posted on Jan 13

Stop Writing Fragile Prompts: Extract Structured Data from PDFs with DSPy + CocoIndex

#ai #tutorial #python #datascience

TL;DR: Traditional prompt engineering is fragile—small changes break everything. This tutorial shows how to extract structured patient data from PDF intake forms using DSPy's typed Signatures + CocoIndex's incremental processing. No OCR preprocessing, no regex, just declarative code.

The Problem with Prompt Engineering

If you've built LLM applications, you know the pain:

You write a carefully crafted prompt with instructions and few-shot examples
It works... until the model changes, or the data shifts slightly
Your output format breaks, and you're back to debugging strings

Logic buried in strings is hard to test, compose, or version.

What if there was a better way?

Enter DSPy + CocoIndex

DSPy (from Stanford) replaces prompt engineering with a programming model. You define what each LLM step should do (inputs, outputs, constraints), and the framework figures out how to prompt the model.

CocoIndex is an ultra-performant data processing engine (Rust-powered) that handles:

File ingestion from any source
Incremental processing (only reprocess changed documents)
Caching and lineage tracking

Together, they form a powerful combo: DSPy owns "how the model thinks," CocoIndex owns "how data moves and stays fresh."

What We're Building

A pipeline that:

Reads PDF patient intake forms
Converts pages to images
Extracts structured Patient data using vision models
Exports to PostgreSQL with automatic updates

Step 1: Define Your Schema with Pydantic

Instead of parsing unstructured text, we define exactly what we want:

from pydantic import BaseModel, Field
from datetime import date

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Insurance(BaseModel):
    provider: str
    policy_number: str
    group_number: str | None = None
    policyholder_name: str

class Patient(BaseModel):
    name: str
    dob: date
    gender: str
    address: Address
    phone: str
    email: str
    insurance: Insurance | None = None
    reason_for_visit: str
    allergies: list[str] = Field(default_factory=list)
    current_medications: list[str] = Field(default_factory=list)
    consent_given: bool

This FHIR-inspired schema gives us validation, nested models, and type safety out of the box.

Step 2: Create the DSPy Signature

A Signature is DSPy's way of declaring the task contract:

import dspy

class PatientExtractionSignature(dspy.Signature):
    """Extract structured patient information from a medical intake form image."""

    form_images: list[dspy.Image] = dspy.InputField(
        desc="Images of the patient intake form pages"
    )
    patient: Patient = dspy.OutputField(
        desc="Extracted patient information with all available fields filled"
    )

Notice: No prompts, no examples, no parsing logic. Just a typed contract.

Step 3: Build the Extractor Module

class PatientExtractor(dspy.Module):
    def __init__(self):
        super().__init__()
        self.extract = dspy.ChainOfThought(PatientExtractionSignature)

    def forward(self, form_images: list[dspy.Image]) -> Patient:
        result = self.extract(form_images=form_images)
        return result.patient

ChainOfThought handles the reasoning—DSPy translates your signature into an effective prompt automatically.

Step 4: Wire It Into CocoIndex

Here's where incremental processing magic happens:

import cocoindex
import pymupdf

@cocoindex.op.function(cache=True, behavior_version=1)
def extract_patient(pdf_content: bytes) -> Patient:
    # Convert PDF to images
    pdf_doc = pymupdf.open(stream=pdf_content, filetype="pdf")
    form_images = []

    for page in pdf_doc:
        pix = page.get_pixmap(matrix=pymupdf.Matrix(2, 2))  # 2x resolution
        form_images.append(dspy.Image(pix.tobytes("png")))

    pdf_doc.close()

    # Extract with DSPy
    extractor = PatientExtractor()
    return extractor(form_images=form_images)

The cache=True decorator means repeated calls with the same PDF reuse results—no wasted API calls.

Step 5: Define the Flow

@cocoindex.flow_def(name="PatientIntakeExtraction")
def patient_intake_flow(flow_builder, data_scope):
    # Source: local PDF files
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="data/patient_forms", binary=True)
    )

    patients_index = data_scope.add_collector()

    with data_scope["documents"].row() as doc:
        doc["patient_info"] = doc["content"].transform(extract_patient)
        patients_index.collect(
            filename=doc["filename"],
            patient_info=doc["patient_info"]
        )

    # Export to Postgres
    patients_index.export(
        "patients",
        cocoindex.storages.Postgres(table_name="patients_info"),
        primary_key_fields=["filename"]
    )

Why This Approach Wins

Traditional Approach	DSPy + CocoIndex
Fragile prompts	Typed Signatures
Manual parsing	Automatic structured output
Full reprocessing	Incremental updates
No audit trail	Built-in lineage tracking
String debugging	Testable modules

Run It

pip install cocoindex dspy-ai pydantic pymupdf
cocoindex update main

That's it. Your PDFs are now structured data in Postgres, automatically kept in sync.

Key Takeaways

DSPy replaces prompt engineering with a programming model—define the contract, not the implementation
Vision models eliminate OCR complexity—just pass images directly
CocoIndex handles the plumbing—caching, incremental updates, lineage
Pydantic gives you validation and nested structures for free
Neither tool tries to be the whole stack—they compose beautifully

Resources

Have you tried DSPy or CocoIndex? Drop a comment with your experience—I'd love to hear how others are solving structured extraction problems!

DEV Community