DEV Community

Linghua Jin
Linghua Jin

Posted on

Stop Writing Fragile Prompts: Extract Structured Data from PDFs with DSPy + CocoIndex

TL;DR: Traditional prompt engineering is fragile—small changes break everything. This tutorial shows how to extract structured patient data from PDF intake forms using DSPy's typed Signatures + CocoIndex's incremental processing. No OCR preprocessing, no regex, just declarative code.


The Problem with Prompt Engineering

If you've built LLM applications, you know the pain:

  • You write a carefully crafted prompt with instructions and few-shot examples
  • It works... until the model changes, or the data shifts slightly
  • Your output format breaks, and you're back to debugging strings

Logic buried in strings is hard to test, compose, or version.

What if there was a better way?


Enter DSPy + CocoIndex

DSPy (from Stanford) replaces prompt engineering with a programming model. You define what each LLM step should do (inputs, outputs, constraints), and the framework figures out how to prompt the model.

CocoIndex is an ultra-performant data processing engine (Rust-powered) that handles:

  • File ingestion from any source
  • Incremental processing (only reprocess changed documents)
  • Caching and lineage tracking

Together, they form a powerful combo: DSPy owns "how the model thinks," CocoIndex owns "how data moves and stays fresh."


What We're Building

A pipeline that:

  1. Reads PDF patient intake forms
  2. Converts pages to images
  3. Extracts structured Patient data using vision models
  4. Exports to PostgreSQL with automatic updates

Step 1: Define Your Schema with Pydantic

Instead of parsing unstructured text, we define exactly what we want:

from pydantic import BaseModel, Field
from datetime import date

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Insurance(BaseModel):
    provider: str
    policy_number: str
    group_number: str | None = None
    policyholder_name: str

class Patient(BaseModel):
    name: str
    dob: date
    gender: str
    address: Address
    phone: str
    email: str
    insurance: Insurance | None = None
    reason_for_visit: str
    allergies: list[str] = Field(default_factory=list)
    current_medications: list[str] = Field(default_factory=list)
    consent_given: bool
Enter fullscreen mode Exit fullscreen mode

This FHIR-inspired schema gives us validation, nested models, and type safety out of the box.


Step 2: Create the DSPy Signature

A Signature is DSPy's way of declaring the task contract:

import dspy

class PatientExtractionSignature(dspy.Signature):
    """Extract structured patient information from a medical intake form image."""

    form_images: list[dspy.Image] = dspy.InputField(
        desc="Images of the patient intake form pages"
    )
    patient: Patient = dspy.OutputField(
        desc="Extracted patient information with all available fields filled"
    )
Enter fullscreen mode Exit fullscreen mode

Notice: No prompts, no examples, no parsing logic. Just a typed contract.


Step 3: Build the Extractor Module

class PatientExtractor(dspy.Module):
    def __init__(self):
        super().__init__()
        self.extract = dspy.ChainOfThought(PatientExtractionSignature)

    def forward(self, form_images: list[dspy.Image]) -> Patient:
        result = self.extract(form_images=form_images)
        return result.patient
Enter fullscreen mode Exit fullscreen mode

ChainOfThought handles the reasoning—DSPy translates your signature into an effective prompt automatically.


Step 4: Wire It Into CocoIndex

Here's where incremental processing magic happens:

import cocoindex
import pymupdf

@cocoindex.op.function(cache=True, behavior_version=1)
def extract_patient(pdf_content: bytes) -> Patient:
    # Convert PDF to images
    pdf_doc = pymupdf.open(stream=pdf_content, filetype="pdf")
    form_images = []

    for page in pdf_doc:
        pix = page.get_pixmap(matrix=pymupdf.Matrix(2, 2))  # 2x resolution
        form_images.append(dspy.Image(pix.tobytes("png")))

    pdf_doc.close()

    # Extract with DSPy
    extractor = PatientExtractor()
    return extractor(form_images=form_images)
Enter fullscreen mode Exit fullscreen mode

The cache=True decorator means repeated calls with the same PDF reuse results—no wasted API calls.


Step 5: Define the Flow

@cocoindex.flow_def(name="PatientIntakeExtraction")
def patient_intake_flow(flow_builder, data_scope):
    # Source: local PDF files
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="data/patient_forms", binary=True)
    )

    patients_index = data_scope.add_collector()

    with data_scope["documents"].row() as doc:
        doc["patient_info"] = doc["content"].transform(extract_patient)
        patients_index.collect(
            filename=doc["filename"],
            patient_info=doc["patient_info"]
        )

    # Export to Postgres
    patients_index.export(
        "patients",
        cocoindex.storages.Postgres(table_name="patients_info"),
        primary_key_fields=["filename"]
    )
Enter fullscreen mode Exit fullscreen mode

Why This Approach Wins

Traditional Approach DSPy + CocoIndex
Fragile prompts Typed Signatures
Manual parsing Automatic structured output
Full reprocessing Incremental updates
No audit trail Built-in lineage tracking
String debugging Testable modules

Run It

pip install cocoindex dspy-ai pydantic pymupdf
cocoindex update main
Enter fullscreen mode Exit fullscreen mode

That's it. Your PDFs are now structured data in Postgres, automatically kept in sync.


Key Takeaways

  1. DSPy replaces prompt engineering with a programming model—define the contract, not the implementation
  2. Vision models eliminate OCR complexity—just pass images directly
  3. CocoIndex handles the plumbing—caching, incremental updates, lineage
  4. Pydantic gives you validation and nested structures for free
  5. Neither tool tries to be the whole stack—they compose beautifully

Resources


Have you tried DSPy or CocoIndex? Drop a comment with your experience—I'd love to hear how others are solving structured extraction problems!

Top comments (0)