TL;DR: Traditional prompt engineering is fragile—small changes break everything. This tutorial shows how to extract structured patient data from PDF intake forms using DSPy's typed Signatures + CocoIndex's incremental processing. No OCR preprocessing, no regex, just declarative code.
The Problem with Prompt Engineering
If you've built LLM applications, you know the pain:
- You write a carefully crafted prompt with instructions and few-shot examples
- It works... until the model changes, or the data shifts slightly
- Your output format breaks, and you're back to debugging strings
Logic buried in strings is hard to test, compose, or version.
What if there was a better way?
Enter DSPy + CocoIndex
DSPy (from Stanford) replaces prompt engineering with a programming model. You define what each LLM step should do (inputs, outputs, constraints), and the framework figures out how to prompt the model.
CocoIndex is an ultra-performant data processing engine (Rust-powered) that handles:
- File ingestion from any source
- Incremental processing (only reprocess changed documents)
- Caching and lineage tracking
Together, they form a powerful combo: DSPy owns "how the model thinks," CocoIndex owns "how data moves and stays fresh."
What We're Building
A pipeline that:
- Reads PDF patient intake forms
- Converts pages to images
- Extracts structured
Patientdata using vision models - Exports to PostgreSQL with automatic updates
Step 1: Define Your Schema with Pydantic
Instead of parsing unstructured text, we define exactly what we want:
from pydantic import BaseModel, Field
from datetime import date
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
class Insurance(BaseModel):
provider: str
policy_number: str
group_number: str | None = None
policyholder_name: str
class Patient(BaseModel):
name: str
dob: date
gender: str
address: Address
phone: str
email: str
insurance: Insurance | None = None
reason_for_visit: str
allergies: list[str] = Field(default_factory=list)
current_medications: list[str] = Field(default_factory=list)
consent_given: bool
This FHIR-inspired schema gives us validation, nested models, and type safety out of the box.
Step 2: Create the DSPy Signature
A Signature is DSPy's way of declaring the task contract:
import dspy
class PatientExtractionSignature(dspy.Signature):
"""Extract structured patient information from a medical intake form image."""
form_images: list[dspy.Image] = dspy.InputField(
desc="Images of the patient intake form pages"
)
patient: Patient = dspy.OutputField(
desc="Extracted patient information with all available fields filled"
)
Notice: No prompts, no examples, no parsing logic. Just a typed contract.
Step 3: Build the Extractor Module
class PatientExtractor(dspy.Module):
def __init__(self):
super().__init__()
self.extract = dspy.ChainOfThought(PatientExtractionSignature)
def forward(self, form_images: list[dspy.Image]) -> Patient:
result = self.extract(form_images=form_images)
return result.patient
ChainOfThought handles the reasoning—DSPy translates your signature into an effective prompt automatically.
Step 4: Wire It Into CocoIndex
Here's where incremental processing magic happens:
import cocoindex
import pymupdf
@cocoindex.op.function(cache=True, behavior_version=1)
def extract_patient(pdf_content: bytes) -> Patient:
# Convert PDF to images
pdf_doc = pymupdf.open(stream=pdf_content, filetype="pdf")
form_images = []
for page in pdf_doc:
pix = page.get_pixmap(matrix=pymupdf.Matrix(2, 2)) # 2x resolution
form_images.append(dspy.Image(pix.tobytes("png")))
pdf_doc.close()
# Extract with DSPy
extractor = PatientExtractor()
return extractor(form_images=form_images)
The cache=True decorator means repeated calls with the same PDF reuse results—no wasted API calls.
Step 5: Define the Flow
@cocoindex.flow_def(name="PatientIntakeExtraction")
def patient_intake_flow(flow_builder, data_scope):
# Source: local PDF files
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="data/patient_forms", binary=True)
)
patients_index = data_scope.add_collector()
with data_scope["documents"].row() as doc:
doc["patient_info"] = doc["content"].transform(extract_patient)
patients_index.collect(
filename=doc["filename"],
patient_info=doc["patient_info"]
)
# Export to Postgres
patients_index.export(
"patients",
cocoindex.storages.Postgres(table_name="patients_info"),
primary_key_fields=["filename"]
)
Why This Approach Wins
| Traditional Approach | DSPy + CocoIndex |
|---|---|
| Fragile prompts | Typed Signatures |
| Manual parsing | Automatic structured output |
| Full reprocessing | Incremental updates |
| No audit trail | Built-in lineage tracking |
| String debugging | Testable modules |
Run It
pip install cocoindex dspy-ai pydantic pymupdf
cocoindex update main
That's it. Your PDFs are now structured data in Postgres, automatically kept in sync.
Key Takeaways
- DSPy replaces prompt engineering with a programming model—define the contract, not the implementation
- Vision models eliminate OCR complexity—just pass images directly
- CocoIndex handles the plumbing—caching, incremental updates, lineage
- Pydantic gives you validation and nested structures for free
- Neither tool tries to be the whole stack—they compose beautifully
Resources
Have you tried DSPy or CocoIndex? Drop a comment with your experience—I'd love to hear how others are solving structured extraction problems!
Top comments (0)