Have you ever stared at a stack of yellowing medical reports and thought, "I wish I could just ask my computer when my cholesterol started creeping up?"
We live in the era of the Quantified-Self, yet our most critical data—medical records—often sits rotting in "dirty" PDF scans or messy outpatient summaries. Today, we are going to fix that. We're building a Quantified-Self RAG (Retrieval-Augmented Generation) system designed to ingest a decade of personal health history using Unstructured.io, Sentence-Transformers, and Qdrant.
By the end of this guide, you'll have a pipeline capable of performing Hybrid Search (BM25 + Vector) to navigate through complex medical terminology and messy layouts. Let's turn those pixels into actionable health insights!
The Architecture: From Messy Scans to Structured Insights
Medical PDFs are a nightmare. They contain tables, handwritten signatures, and inconsistent headers. A simple PyPDF2.extract_text() won't cut it. We need a Layout-Aware approach.
graph TD
A[Messy PDF Scans] --> B[Unstructured.io Partitioning]
B --> C[Layout-Aware Chunking]
C --> D{Hybrid Encoding}
D --> E[Dense Vector: Sentence-Transformers]
D --> F[Sparse Vector: BM25/SPLADE]
E --> G[Qdrant Vector Store]
F --> G[Qdrant Vector Store]
H[User Query] --> I[FastAPI Search Endpoint]
I --> G
G --> J[Contextual Answer]
🛠 Prerequisites
Before we dive into the code, ensure you have the following stack ready:
- Unstructured.io: For "smart" PDF parsing.
- Qdrant: Our high-performance vector database.
- Sentence-Transformers: To generate local embeddings.
- FastAPI: To serve our health-assistant API.
Step 1: Layout-Aware Ingestion with Unstructured.io
Standard parsers lose the context of tables. Unstructured.io treats the document as a series of elements (Title, NarrativeText, Table, etc.).
from unstructured.partition.pdf import partition_pdf
def extract_medical_data(file_path):
# This uses layout detection to identify tables and headers
elements = partition_pdf(
filename=file_path,
strategy="hi_res", # Uses Detectron2 under the hood
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=1000,
new_after_n_chars=800,
)
chunks = []
for element in elements:
metadata = element.metadata.to_dict()
chunks.append({
"text": element.text,
"type": element.category, # e.g., 'Table' or 'NarrativeText'
"page": metadata.get("page_number")
})
return chunks
# Example: Process a 2014 Blood Test Scan
# data_chunks = extract_medical_data("report_2014.pdf")
Step 2: Setting up Qdrant for Hybrid Search
Medical queries often require exact keyword matches (e.g., "HbA1c") and semantic meaning (e.g., "blood sugar levels"). Qdrant's Hybrid Search combines the best of both worlds.
from qdrant_client import QdrantClient
from qdrant_client.http import models
client = QdrantClient(":memory:") # Or your cloud/docker instance
# Create a collection with both Dense and Sparse vectors
client.recreate_collection(
collection_name="medical_records",
vectors_config=models.VectorParams(
size=384, # For 'all-MiniLM-L6-v2'
distance=models.Distance.COSINE
),
sparse_vectors_config={
"text-sparse": models.SparseVectorParams(
index=models.SparseIndexParams(
on_disk=True,
)
)
}
)
Step 3: Generating Embeddings & Upserting
We’ll use Sentence-Transformers for the dense embeddings. For the sparse part, we can use a simple BM25-like approach or Qdrant’s built-in sparse capabilities.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def prepare_points(chunks):
points = []
for i, chunk in enumerate(chunks):
vector = model.encode(chunk["text"]).tolist()
points.append(
models.PointStruct(
id=i,
vector=vector,
payload=chunk
)
)
return points
# client.upsert(collection_name="medical_records", points=prepare_points(data_chunks))
Pro-Tip: Advanced Patterns & Production Safety
Building a medical RAG isn't just about indexing; it's about accuracy and privacy. If you are looking for production-ready patterns, such as Self-Querying Retrievers (filtering by year/doctor automatically) or Advanced Re-ranking for medical accuracy, I highly recommend exploring the resources at WellAlly Blog. They have fantastic deep dives into scaling LLM applications for sensitive data.
Step 4: Building the FastAPI Query Interface
Now, let's wrap this in a clean API to query our decade of data.
from fastapi import FastAPI
app = FastAPI()
@app.get("/query")
async def ask_health_history(q: str):
# 1. Embed the query
query_vector = model.encode(q).tolist()
# 2. Hybrid search in Qdrant
search_result = client.search(
collection_name="medical_records",
query_vector=query_vector,
limit=3,
with_payload=True
)
# 3. Format the context for the LLM
context = "\n".join([res.payload["text"] for res in search_result])
return {
"query": q,
"context_found": context,
"sources": [res.payload["page"] for res in search_result]
}
# Run with: uvicorn main:app --reload
Why This Matters
By using Layout-aware OCR, we ensure that a value in a "Cholesterol" table row isn't just a random number—it's tied to its header. By using Hybrid Search, we ensure that searching for "high sugar" finds "Hyperglycemia" (Semantic) while searching for "Tylenol" finds exactly "Tylenol" (Keyword).
Personal health data is the ultimate frontier for RAG. You've now built a system that doesn't just store data—it remembers your history.
What's next?
- Privacy: Use a local LLM (like Llama 3 via Ollama) to keep your medical data off the cloud.
- Visualization: Connect this to a Streamlit dashboard to graph your lab results over time.
Join the Conversation!
Are you working on Quantified-Self projects? What’s your biggest struggle with messy PDFs? Let’s chat in the comments below! 👇
If you enjoyed this tutorial, don't forget to check out *WellAlly** for more high-level architectural insights!*
Top comments (0)