DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

# Medical RAG Architecture Overview #llmszoomcamp

This document provides a comprehensive explanation of the Retrieval-Augmented Generation (RAG) system architecture, breaking down each component and showing how they work together to deliver accurate medical information.

1. What is RAG?

RAG combines the power of:

  • Information Retrieval: Finding relevant documents from a knowledge base
  • Language Generation: Using LLMs to synthesize coherent answers
  • Context Grounding: Ensuring answers are based on retrieved evidence

2. High-Level Architecture Flow

[User Question] 
    ↓
[Hybrid Search: Vector + BM25]
    ↓
[Context Assembly & Prompt Building]
    ↓
[LLM Generation (GPT-4o-mini/GPT-4o)]
    ↓
[Answer Evaluation & Quality Assessment]
    ↓
[Metrics Calculation & Response Packaging]
Enter fullscreen mode Exit fullscreen mode

3. Detailed Processing Pipeline

Step 1: Query Processing

  • Clean and normalize the input medical question
  • Prepare query for both semantic and lexical search

Step 2: Hybrid Retrieval

  • Vector Search: Semantic similarity using 384-dimensional embeddings
  • BM25 Search: Keyword-based exact matching
  • RRF Fusion: Combines both approaches using Reciprocal Rank Fusion

Step 3: Context Assembly

  • Select top-k most relevant medical cases
  • Format retrieved documents into structured context
  • Apply medical domain-specific scoring enhancements

Step 4: Answer Generation

  • Build specialized medical prompt with retrieved context
  • Generate response using OpenAI models with controlled parameters
  • Apply medical safety guidelines

Step 5: Quality Assurance

  • Evaluate answer relevance using LLM-as-a-judge
  • Calculate confidence scores and metadata
  • Track performance metrics and costs

4. Core System Components

Component File Location Primary Responsibility
RAG Orchestrator src/core/rag.py Main pipeline coordination
Vector Database src/database/vector_db.py Hybrid search + RRF fusion
Data Ingestion scripts/ingest.py Document processing & indexing
API Layer src/api/main_api.py REST endpoints & async processing
Web Interface src/api/web_interface.py Interactive Streamlit UI
Monitoring src/services/s3_service.py Logging & metrics collection

5. Advanced Search Mechanism

Hybrid Search Strategy

Our system implements a sophisticated hybrid approach that combines:

  1. Semantic Vector Search (Cosine Similarity)
   # src/core/rag.py
   def search(query: str, top_k: int = 5) -> List[Dict]:
       """Search medical knowledge base using hybrid search"""
       return hybrid_query_rrf(query, top_k=top_k)
Enter fullscreen mode Exit fullscreen mode
  1. BM25 Keyword Search (Exact Token Matching)
    • Handles medical terminology and acronyms
    • Captures exact drug names and dosages
    • Preserves clinical precision

Reciprocal Rank Fusion (RRF) Algorithm

RRF combines multiple ranking approaches using the formula:

RRF_score = Σ(1 / (k + rank_i))
Enter fullscreen mode Exit fullscreen mode

Where k=60 (tuning parameter) and rank_i is the position in each ranking list.

Medical Domain Scoring Enhancements

  • Severity Weighting: Life-threatening conditions get priority
  • Department Relevance: Matches medical specialties
  • Symptom Alignment: Boosts exact symptom matches
  • Treatment Precision: Enhances therapeutic recommendations

6. Medical Prompt Engineering

Structured Prompt Architecture

Our prompts are carefully designed for medical accuracy and safety:

System Instruction Design

  • Role Definition: "You are a knowledgeable medical assistant"
  • Evidence Constraint: "Answer based solely on provided CONTEXT"
  • Factual Grounding: "Use only facts from the CONTEXT"

Context Formatting Strategy

Each retrieved medical case follows a structured template:

# src/core/rag.py
PROMPT_TEMPLATE = """You are a knowledgeable medical assistant. Answer the QUESTION based solely on the information provided in the CONTEXT from the medical database.

Use only the facts from the CONTEXT when formulating your answer.

QUESTION: {question}

CONTEXT:
{context}""".strip()

ENTRY_TEMPLATE = """Medical Case:
Question: {question}
Answer: {answer}
Relevance Score: {score:.3f}""".strip()
Enter fullscreen mode Exit fullscreen mode
# src/core/rag.py
def build_prompt(query: str, search_results: List[Dict]) -> str:
    context = ""
    for doc in search_results:
        context += (
            ENTRY_TEMPLATE.format(
                question=doc.get("question", "N/A"),
                answer=doc.get("answer", "N/A"),
                score=doc.get("score", 0.0),
            ) + "\n\n"
        )
    return PROMPT_TEMPLATE.format(question=query, context=context.strip())
Enter fullscreen mode Exit fullscreen mode

8. Language Model Integration

Model Selection Strategy

# src/core/rag.py
def llm(prompt: str, model: str = "gpt-4o-mini") -> Tuple[str, Dict]:
    """Generate response using OpenAI LLM with medical-optimized parameters"""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1000,        # Sufficient for comprehensive medical answers
        temperature=0.1,        # Low temperature for consistency and accuracy
    )
Enter fullscreen mode Exit fullscreen mode

Parameter Optimization for Medical Use

  • Temperature (0.1): Ensures deterministic, conservative responses
  • Max Tokens (1000): Balances comprehensiveness with cost
  • Model Choice: GPT-4o-mini provides 91.11% relevance vs GPT-4o's 64.75%

Cost-Performance Analysis

Model Relevance Rate Cost per 1K tokens Use Case
GPT-4o-mini 91.11% $0.00015 (input) Primary model
GPT-4o 64.75% $0.03 (input) Complex cases only

Response Processing

token_stats = {
    "prompt_tokens": response.usage.prompt_tokens,
    "completion_tokens": response.usage.completion_tokens,
    "total_tokens": response.usage.total_tokens,
}
return answer, token_stats
Enter fullscreen mode Exit fullscreen mode

9. Comprehensive Answer Evaluation

LLM-as-a-Judge Methodology

We implement automated quality assessment using a specialized evaluation prompt:

# src/core/rag.py
EVALUATION_PROMPT_TEMPLATE = """You are an expert medical reviewer evaluating the quality and relevance of AI-generated medical responses.

You will be given a medical question and a generated answer. Based on the relevance of the generated answer, you will classify it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Question: {question}
Generated Answer: {answer}

Provide evaluation in JSON format:
{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Brief explanation for evaluation]"
}}"""
Enter fullscreen mode Exit fullscreen mode

Quality Assessment Metrics

  1. Relevance Categories:

    • RELEVANT: Direct, accurate medical information
    • PARTLY_RELEVANT: Partially helpful but incomplete
    • NON_RELEVANT: Off-topic or potentially harmful
  2. Evaluation Processing:

   def evaluate_relevance(question: str, answer: str) -> Tuple[Dict, Dict]:
       prompt = EVALUATION_PROMPT_TEMPLATE.format(question=question, answer=answer)
       evaluation, tokens = llm(prompt, model="gpt-4o-mini")
       try:
           json_eval = json.loads(evaluation)
           return json_eval, tokens
       except json.JSONDecodeError:
           return {"Relevance": "UNKNOWN", "Explanation": "Parse failed"}, tokens
Enter fullscreen mode Exit fullscreen mode

Medical Safety Considerations

  • Conservative Evaluation: Strict relevance criteria
  • Explanation Tracking: Maintains audit trail for quality decisions
  • Error Handling: Graceful degradation for parsing failures
  • Dual Model Use: Separate evaluation model reduces bias

10. Cost Optimization & Monitoring

Transparent Cost Calculation

# src/core/rag.py
def calculate_openai_cost(model: str, tokens: Dict) -> float:
    """Calculate OpenAI API cost with model-specific pricing"""
    cost = 0.0
    if model == "gpt-4o-mini":
        cost = (
            tokens.get("prompt_tokens", 0) * 0.00015 +      # Input cost
            tokens.get("completion_tokens", 0) * 0.0006     # Output cost
        ) / 1000
    elif model == "gpt-4o":
        cost = (
            tokens.get("prompt_tokens", 0) * 0.03 +         # Higher input cost
            tokens.get("completion_tokens", 0) * 0.06       # Higher output cost
        ) / 1000
    return cost
Enter fullscreen mode Exit fullscreen mode

Cost Performance Metrics

  • Average Cost per Query: $0.003
  • Token Efficiency: ~23 seconds response time
  • Cost Breakdown: RAG generation + evaluation costs tracked separately

11. Complete Pipeline Integration

Master RAG Function

# src/core/rag.py
def rag(query: str, model: str = "gpt-4o-mini") -> Dict:
    """Complete RAG pipeline with comprehensive response data"""
    t0 = time()  # Start timing

    # Step 1: Hybrid search
    search_results = search(query)

    # Step 2: Prompt assembly
    prompt = build_prompt(query, search_results)

    # Step 3: LLM generation
    answer, token_stats = llm(prompt, model=model)

    # Step 4: Quality evaluation
    relevance, rel_token_stats = evaluate_relevance(query, answer)

    # Step 5: Metrics calculation
    took = time() - t0
    total_cost = calculate_openai_cost(model, token_stats) + \
                calculate_openai_cost("gpt-4o-mini", rel_token_stats)

    # Step 6: Response packaging
    return {
        "answer": answer,
        "model_used": model,
        "response_time": took,
        "relevance": relevance.get("Relevance", "UNKNOWN"),
        "relevance_explanation": relevance.get("Explanation", "None"),
        "total_cost": total_cost,
        "token_stats": {...},  # Comprehensive token tracking
        "search_results_count": len(search_results),
        "search_results": search_results[:5]  # Top results for audit
    }
Enter fullscreen mode Exit fullscreen mode

12. Architecture Design Principles

Medical Safety First

  • Evidence-Based: All answers must cite retrieved medical literature
  • Conservative Generation: Low temperature prevents hallucination
  • Quality Gates: Multi-step evaluation ensures reliability
  • Audit Trails: Complete logging for medical compliance

Performance Optimization

  • Hybrid Retrieval: Combines semantic understanding with exact matching
  • Model Selection: Cost-effective model choice based on performance data
  • Caching Ready: Architecture supports future caching implementations
  • Scalable Design: Async processing and background tasks

Production Considerations

  • Error Handling: Graceful degradation at each step
  • Monitoring Integration: Built-in metrics and logging hooks
  • Cost Control: Transparent pricing with usage tracking
  • Extensibility: Modular design supports feature additions

13. Future Enhancement Opportunities

Immediate Improvements

  • Citation Generation: Add source document references in answers
  • Query Preprocessing: Medical entity recognition and normalization
  • Response Caching: Cache frequent queries to reduce costs
  • Batch Processing: Optimize multiple simultaneous queries

Advanced Features

  • Multi-Modal Support: Integrate medical images and charts
  • Specialized Models: Fine-tuned models for specific medical domains
  • Real-Time Learning: Incorporate user feedback into model updates
  • Clinical Integration: EMR system compatibility and FHIR support

Research Directions

  • Retrieval Enhancement: Advanced embedding models for medical text
  • Generation Improvement: Medical-specific language model fine-tuning
  • Evaluation Evolution: Automated medical accuracy assessment
  • Safety Advancement: Enhanced harm detection and prevention

Top comments (0)