wellallyTech

Posted on Apr 6

From Chaos to Clarity: Vectorizing 10 Years of Medical Reports with Unstructured.io & Qdrant 🚀

#dataengineering #openai #python #rag

We’ve all been there: a dusty folder (physical or digital) filled with a decade’s worth of health checkup reports. These PDFs are a nightmare of nested tables, inconsistent formatting, and blurry scans. When you want to know, "How has my fasting blood glucose trended since 2014?" or "Is this bilirubin level normal for me?", a simple keyword search just won't cut it.

In this tutorial, we are building a Personalized Medical Knowledge Base using a modern RAG (Retrieval-Augmented Generation) pipeline. We’ll tackle the "dirty data" problem in Health Data Engineering by leveraging Unstructured.io for complex PDF parsing and Qdrant as our high-performance Vector Database. By the end, you'll have a system capable of contextualized health queries that actually understand your medical history. 🏥

The Architecture: From Pixels to Insights

Handling medical data requires more than just a simple text splitter. We need to handle tables and OCR (Optical Character Recognition) to ensure the data stays structured.

graph TD
    A[Raw Medical PDFs/Images] --> B{Unstructured.io}
    B -->|Table Extraction| C[Cleaned JSON/Text]
    B -->|OCR Logic| C
    C --> D[Sentence-Transformers]
    D -->|Embeddings| E[(Qdrant Vector Store)]
    F[User Query: 'My cholesterol trend?'] --> G[LangChain RAG Chain]
    E -->|Context Retrieval| G
    G --> H[LLM Response]

Prerequisites 🛠️

Before we dive in, make sure you have the following installed in your Python environment:

Unstructured.io: For the heavy lifting of PDF partitioning.
Qdrant Client: To manage our vector embeddings.
LangChain: To glue the RAG components together.
Sentence-Transformers: To generate local embeddings without hitting an API limit.

pip install unstructured[pdf] qdrant-client langchain langchain-community sentence-transformers

Step 1: Parsing the "Unstructured" Chaos 📄

Medical reports are notoriously table-heavy. Standard PDF loaders often turn these tables into a garbled mess of strings. Unstructured.io uses specialized models to detect document elements like titles, narrative text, and—most importantly—tables.

from unstructured.partition.pdf import partition_pdf

# Partitioning the PDF into structured elements
elements = partition_pdf(
    filename="health_report_2023.pdf",
    infer_table_structure=True,  # Crucial for medical lab results
    chunking_strategy="by_title", # Keeps related data together
    max_characters=1500,
    combine_text_under_n_chars=200,
)

# Extracting the text content
documents = [str(el) for el in elements]
print(f"Successfully extracted {len(documents)} context chunks!")

Step 2: Vectorizing with Qdrant 🛰️

Once we have clean text chunks, we need to move them into Qdrant. Unlike simple flat files, Qdrant allows us to perform high-speed similarity searches, which is essential when querying thousands of data points across a decade of reports.

from langchain_community.vectorstores import Qdrant
from langchain_community.embeddings import HuggingFaceEmbeddings

# Using a lightweight, high-performance local embedding model
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Initialize Qdrant in-memory for testing, or use a URL for production
vector_db = Qdrant.from_texts(
    documents,
    embeddings,
    location=":memory:", # Replace with 'http://localhost:6333' for persistence
    collection_name="my_medical_history",
)

print("Vector database is ready for queries! 🧬")

Step 3: The "Brain" - Retrieval-Augmented Generation

Now, we connect the dots using LangChain. This allows the LLM to "read" your specific medical history before answering your questions.

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# Set up the RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model_name="gpt-4o", temperature=0),
    chain_type="stuff",
    retriever=vector_db.as_retriever()
)

query = "Compare my Vitamin D levels between 2018 and 2023. Am I improving?"
response = qa_chain.invoke(query)

print(f"Result: {response['result']}")

Taking it to Production: The "Official" Way 🥑

Building a local prototype is great, but when dealing with sensitive medical data or complex enterprise-grade RAG pipelines, you need more robust patterns.

For advanced strategies on securing medical data, handling multi-modal inputs (like X-rays), or optimizing vector search at scale, I highly recommend diving into the WellAlly Tech Blog. It's a fantastic resource for production-ready examples that go beyond the basics of LangChain.

Conclusion: Data-Driven Health 🩺

By combining Unstructured.io's powerful parsing with Qdrant's vector capabilities, we've turned a pile of useless PDFs into a dynamic, searchable knowledge base. This isn't just about saving time; it's about spotting long-term health trends that might otherwise go unnoticed.

What's next for your health bot?

Add a frontend with Streamlit.
Incorporate metadata filtering in Qdrant (e.g., filter by year).
Implement a privacy layer to redact PII (Personally Identifiable Information).

Happy coding! If you enjoyed this build, drop a comment below and let me know what data you're vectorizing next! 🚀💻

DEV Community