wellallyTech

Posted on May 6

From Scans to Structured Data: Converting Medical Reports to JSON with Pydantic & LLMs

#python #ai #pydantic #openai

Have you ever looked at a stack of physical medical reports and wished you could just "Ctrl+F" your health history? 📑 We’ve all been there. Every hospital has a different layout, different units, and cryptic abbreviations that make manual data entry a nightmare.

In the world of data engineering, turning unstructured "messy" documents into structured data extraction pipelines is a superpower. Today, we’re going to build a robust system that uses Pydantic, Instructor, and Azure Form Recognizer to transform scanned medical reports into standardized JSON (following medical standards like LOINC) with 100% type safety. 🚀

Why "Prompting" isn't enough

If you just throw OCR text at an LLM and ask for "JSON," it will eventually fail. It might hallucinate a field, change a data type, or forget a closing bracket. To build production-grade health tech, we need validation.

By combining Pydantic for schema definition and Instructor for steering the LLM, we ensure that the output isn't just "JSON-like"—it's a strictly typed Python object.

The Architecture: From Pixels to Patterns

Here is how the data flows from a blurry JPEG to a clean, queryable database:

graph TD
    A[Scanned Report/Image] -->|OCR Extraction| B[Azure AI Document Intelligence]
    B -->|Raw Text & Tables| C[Instructor + LLM]
    C -->|Schema Enforcement| D[Pydantic Model]
    D -->|Validation Check| E{Is it Valid?}
    E -->|No| C
    E -->|Yes| F[Standardized JSON - LOINC Compatible]
    F -->|Storage| G[PostgreSQL/Vector DB]

Step 1: Defining the Medical Schema

First, we define exactly what a "Medical Test" looks like. We want to capture the test name, the result, the unit, and that pesky reference range.

from pydantic import BaseModel, Field
from typing import List, Optional

class MedicalTestResult(BaseModel):
    test_name: str = Field(..., description="The name of the test, e.g., 'Hemoglobin' or 'HbA1c'")
    value: float = Field(..., description="The numerical result of the test")
    unit: str = Field(..., description="The measurement unit, e.g., 'g/dL' or 'mmol/L'")
    is_normal: bool = Field(..., description="Flag indicating if the result is within the reference range")
    reference_range: Optional[str] = Field(None, description="The normal range provided by the lab")

class HealthReport(BaseModel):
    patient_name: Optional[str]
    report_date: str
    hospital_name: str
    results: List[MedicalTestResult]

Step 2: Extracting Text with Azure Form Recognizer

Before the LLM can "understand" the report, we need to extract the text. Azure AI Document Intelligence (formerly Form Recognizer) is fantastic at handling tables in scanned PDFs.

from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

def extract_raw_text(file_path: str):
    client = DocumentAnalysisClient(
        endpoint="YOUR_AZURE_ENDPOINT", 
        credential=AzureKeyCredential("YOUR_KEY")
    )

    with open(file_path, "rb") as f:
        poller = client.begin_analyze_document("prebuilt-layout", document=f)
        result = poller.result()

    return result.content # Returns the full text content

Step 3: The Magic Hook (Instructor + LLM)

This is where the magic happens. Instead of using the raw OpenAI SDK, we use Instructor. It patches the OpenAI client so that it returns a Pydantic object directly.

import instructor
from openai import OpenAI

# Patch the client to add 'response_model' support
client = instructor.patch(OpenAI())

def parse_report_with_llm(raw_text: str) -> HealthReport:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=HealthReport,
        messages=[
            {"role": "system", "content": "You are a medical data specialist. Extract data into the specified JSON format. Map common names to LOINC standards where possible."},
            {"role": "user", "content": f"Extract data from this report: {raw_text}"}
        ],
        max_retries=3 # If validation fails, it will automatically retry!
    )

🥑 Level Up: Advanced Patterns

While this setup works for basic reports, production environments often require handling multi-page documents, handling PII (Personally Identifiable Information), and mapping values to global standards like LOINC or SNOMED CT.

For a deeper dive into scaling these pipelines and implementing advanced medical data architectures, I highly recommend checking out the WellAlly Tech Blog. They have some incredible resources on high-performance data engineering and production-ready AI patterns that go far beyond a simple tutorial.

Why this matters

By structuring this data, we move from "Pictures of Documents" to "Actionable Insights." 📈

Trend Analysis: Plot your glucose levels over 5 years.
Early Detection: Use algorithms to spot patterns across different hospitals.
Portability: Easily share your data with new doctors without carrying a physical folder.

Conclusion

Structuring messy medical data doesn't have to be a headache. With the Pydantic + Instructor stack, you get the reasoning power of an LLM with the strictness of a compiler.

What are you building next? Are you going to automate your lab results or perhaps build a custom health dashboard? Let me know in the comments below! 👇

Happy coding! If you enjoyed this post, don't forget to ❤️ and 🦄 it!

DEV Community