DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

# Data Ingestion & Vector Store #llmszoomcamp

This document explains how medical data is ingested and searched using Qdrant with hybrid retrieval.

1. Responsibilities

  • Load CSV medical Q&A
  • Embed text using Sentence Transformers
  • Store vectors + payload in Qdrant
  • Create BM25 text indexes
  • Perform hybrid (vector + keyword) fusion with RRF

2. Ingestion Entry Point

# scripts/ingest.py
def ingest_data(data_path: str = "data/medical_qa_full.csv") -> MedicalVectorStore:
    print(f"Starting medical data ingestion from: {data_path}")
    _global_vector_store = MedicalVectorStore(
        host="localhost", port=6333, collection_name="medical_knowledge"
    )
    return _global_vector_store
Enter fullscreen mode Exit fullscreen mode

3. Vector DB Class

# src/database/vector_db.py
class MedicalVectorDB:
    def __init__(self, host: str = "localhost", port: int = 6333):
        self.client = QdrantClient(host=host, port=port)
        self.model = SentenceTransformer(
            "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
        )
        self.collection_name = "medical_knowledge"
        self.vector_size = 384
Enter fullscreen mode Exit fullscreen mode

4. Collection Creation

# src/database/vector_db.py
self.client.create_collection(
    collection_name=self.collection_name,
    vectors_config=VectorParams(size=self.vector_size, distance=Distance.COSINE),
)
Enter fullscreen mode Exit fullscreen mode

BM25 Text Indexes

# src/database/vector_db.py
self.client.create_payload_index(
    collection_name=self.collection_name,
    field_name=field,
    field_schema=TextIndexParams(
        type=TextIndexType.TEXT,
        tokenizer=TokenizerType.WORD,
        min_token_len=2,
        max_token_len=20,
        lowercase=True,
    ),
)
Enter fullscreen mode Exit fullscreen mode

5. Preparing Documents

# src/database/vector_db.py
question, combined_text, full_context = self._prepare_document_text(doc)
vector = self.model.encode(combined_text).tolist()
Enter fullscreen mode Exit fullscreen mode

6. Upserting Points

# src/database/vector_db.py
points.append(PointStruct(id=doc_id, vector=vector, payload=payload))
self.client.upsert(collection_name=self.collection_name, points=points)
Enter fullscreen mode Exit fullscreen mode

7. Hybrid Search Overview

Steps:

  1. Encode query → vector search
  2. Build BM25 filter → scroll
  3. Apply Reciprocal Rank Fusion (RRF)
  4. Enhance scores with domain rules
  5. Return top_k results
# src/database/vector_db.py
for rank, hit in enumerate(vector_results):
    rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1.0 / (k + rank + 1))
Enter fullscreen mode Exit fullscreen mode

Domain Scoring Enhancement

# src/database/vector_db.py
if any(word in doc.get("medical_department", "").lower() for word in query_words):
    total_score *= 1.15
Enter fullscreen mode Exit fullscreen mode

8. Why Hybrid?

  • Vector = semantic similarity
  • BM25 = exact token recall
  • RRF combines strengths
  • Medical fields weighted

9. Potential Extensions

  • Add freshness scoring
  • Add field-level boosting configuration in config file
  • Cache embeddings locally

Top comments (0)