Beck_Moulton

Posted on Mar 1

Stop Manually Entering Medical Data: How to Automate PDF Lab Reports with LayoutParser & OCR

#machinelearning #python #dataengineering #healthtech

We’ve all been there: staring at a blurry, scanned PDF of a medical lab report, trying to figure out if that "Glucose" level is actually within the normal range. In the world of Data Engineering, medical documents are the ultimate "black box." Unlike digital PDFs, scanned reports don't have text layers; they are just grids of pixels.

If you're building a health-tech app or a RAG (Retrieval-Augmented Generation) pipeline for medical records, you need more than just raw text. You need Automated Data Extraction and Document AI to turn those pixels into structured, actionable insights. In this tutorial, we are going to build a pipeline using LayoutParser, Tesseract OCR, and Streamlit to decode complex medical charts automatically.

The Challenge: Why `PyPDF2` Isn't Enough

Standard PDF libraries look for text streams. But scanned medical reports are images. To extract data reliably, we need to understand the visual structure—where the headers are, where the table rows sit, and which value belongs to which lab test.

The Architecture

Here is how our intelligent extraction pipeline works:

graph TD
    A[Scanned Medical PDF] --> B[Image Preprocessing]
    B --> C{LayoutParser Analysis}
    C -->|Detect Table| D[Crop Table Region]
    C -->|Detect Header| E[Extract Patient Metadata]
    D --> F[Tesseract OCR Engine]
    F --> G[Pandas Data Cleaning]
    G --> H[Streamlit Dashboard & Tracking]
    H --> I[(Vector DB for RAG)]

Prerequisites

To follow along, you'll need the following tech stack:

LayoutParser: For deep learning-based document layout analysis.
Tesseract OCR: The engine that "reads" the characters.
Pandas: For structuring our extracted data.
Streamlit: To build a quick visualization interface.

pip install layoutparser torchvision tesseract pandas streamlit

Step 1: Intelligent Layout Detection

Instead of blindly OCR-ing the whole page (which leads to "alphabet soup"), we use LayoutParser to identify only the table areas. This significantly increases accuracy.

import layoutparser as lp
import cv2

# Load the pre-trained model for table detection
model = lp.Detectron2LayoutModel(
    'lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config', 
    extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
    label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
)

def detect_structure(image_path):
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Run detection
    layout = model.detect(image)

    # Filter only the tables
    table_blocks = lp.Layout([b for b in layout if b.type == 'Table'])
    return table_blocks, image

Step 2: Extracting Data with Tesseract OCR

Once we have the table coordinates, we crop that area and pass it to Tesseract. This ensures the OCR engine doesn't get confused by logos or footer text.

ocr_agent = lp.TesseractAgent(languages='eng')

def extract_table_data(table_blocks, image):
    for block in table_blocks:
        # Crop the image to the table area
        segment_image = block.pad(left=5, right=5, top=5, bottom=5).crop_image(image)

        # Perform OCR
        res = ocr_agent.detect(segment_image)

        # Basic cleanup: converting OCR result to structured text
        print(res) # In a real app, you'd parse these into a list of lists

Step 3: Cleaning & Visualizing with Streamlit

Medical data is useless if it's messy. We use Pandas to clean up reference ranges (e.g., "3.5 - 5.0") and identify "Abnormal" flags (marked with H or *).

import streamlit as st
import pandas as pd

st.title("Medical Report Parser 🩺")

uploaded_file = st.file_file("Upload a scanned Lab Report (PNG/JPG)")

if uploaded_file:
    # ... (Processing Logic) ...

    df = pd.DataFrame({
        "Test Name": ["Glucose", "Hemoglobin", "Cholesterol"],
        "Result": [110, 14.2, 210],
        "Reference Range": ["70-99", "13.5-17.5", "<200"],
        "Status": ["High", "Normal", "High"]
    })

    st.table(df)

    # Visualization
    st.line_chart(df.set_index('Test Name')['Result'])

The "Official" Way: Advanced Patterns

While the stack above is great for a weekend project, production-grade medical data extraction often requires handling multi-page forms, complex handwriting, and strictly validated schemas.

If you're looking to scale this into a full-scale Document AI pipeline or want to learn about integrating LLMs (like GPT-4o) to interpret these results contextually, you should definitely check out the deep-dives at WellAlly Blog. They cover advanced patterns for RAG-based medical systems and how to handle data privacy (HIPAA compliance) in automated workflows. It's an incredible resource for taking this prototype to the next level.

Conclusion: Turning Data into Action

By combining LayoutParser's spatial awareness with Tesseract's character recognition, we've broken the "black box" of scanned PDFs. This pipeline isn't just about reading text; it's about structured data recovery.

Next Steps for you:

Try adding a fuzzy-matching layer using RapidFuzz to standardize test names (e.g., "Gluc" -> "Glucose").
Connect the output to a Vector Database like Pinecone for a medical RAG assistant.

What are you building with Document AI? Drop a comment below! 👇

DEV Community

Stop Manually Entering Medical Data: How to Automate PDF Lab Reports with LayoutParser & OCR

The Challenge: Why `PyPDF2` Isn't Enough

The Architecture

Prerequisites

Step 1: Intelligent Layout Detection

Step 2: Extracting Data with Tesseract OCR

Step 3: Cleaning & Visualizing with Streamlit

The "Official" Way: Advanced Patterns

Conclusion: Turning Data into Action

Top comments (0)

The Challenge: Why PyPDF2 Isn't Enough

The Architecture

Prerequisites

Step 1: Intelligent Layout Detection

Step 2: Extracting Data with Tesseract OCR

Step 3: Cleaning & Visualizing with Streamlit

The "Official" Way: Advanced Patterns

Conclusion: Turning Data into Action

The Challenge: Why `PyPDF2` Isn't Enough