This document explains how medical data is ingested and searched using Qdrant with hybrid retrieval.
1. Responsibilities
- Load CSV medical Q&A
- Embed text using Sentence Transformers
- Store vectors + payload in Qdrant
- Create BM25 text indexes
- Perform hybrid (vector + keyword) fusion with RRF
2. Ingestion Entry Point
# scripts/ingest.py
def ingest_data(data_path: str = "data/medical_qa_full.csv") -> MedicalVectorStore:
print(f"Starting medical data ingestion from: {data_path}")
_global_vector_store = MedicalVectorStore(
host="localhost", port=6333, collection_name="medical_knowledge"
)
return _global_vector_store
3. Vector DB Class
# src/database/vector_db.py
class MedicalVectorDB:
def __init__(self, host: str = "localhost", port: int = 6333):
self.client = QdrantClient(host=host, port=port)
self.model = SentenceTransformer(
"sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
)
self.collection_name = "medical_knowledge"
self.vector_size = 384
4. Collection Creation
# src/database/vector_db.py
self.client.create_collection(
collection_name=self.collection_name,
vectors_config=VectorParams(size=self.vector_size, distance=Distance.COSINE),
)
BM25 Text Indexes
# src/database/vector_db.py
self.client.create_payload_index(
collection_name=self.collection_name,
field_name=field,
field_schema=TextIndexParams(
type=TextIndexType.TEXT,
tokenizer=TokenizerType.WORD,
min_token_len=2,
max_token_len=20,
lowercase=True,
),
)
5. Preparing Documents
# src/database/vector_db.py
question, combined_text, full_context = self._prepare_document_text(doc)
vector = self.model.encode(combined_text).tolist()
6. Upserting Points
# src/database/vector_db.py
points.append(PointStruct(id=doc_id, vector=vector, payload=payload))
self.client.upsert(collection_name=self.collection_name, points=points)
7. Hybrid Search Overview
Steps:
- Encode query → vector search
- Build BM25 filter → scroll
- Apply Reciprocal Rank Fusion (RRF)
- Enhance scores with domain rules
- Return top_k results
# src/database/vector_db.py
for rank, hit in enumerate(vector_results):
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1.0 / (k + rank + 1))
Domain Scoring Enhancement
# src/database/vector_db.py
if any(word in doc.get("medical_department", "").lower() for word in query_words):
total_score *= 1.15
8. Why Hybrid?
- Vector = semantic similarity
- BM25 = exact token recall
- RRF combines strengths
- Medical fields weighted
9. Potential Extensions
- Add freshness scoring
- Add field-level boosting configuration in config file
- Cache embeddings locally
Top comments (0)