Vivek Patel

Posted on Jan 12

Under the Hood: VaidhLlama Architecture & Training Pipeline

#ai #machinelearning #python #finetuning

Standard LLMs struggle with the Sanskrit-heavy logic of Ayurveda. They often reduce 'doshas' to simple biological humors, missing their deeper role as systemic bio-energetic forces. We built VaidhLlama to fix this.

This technical deep dive explores how we achieved **41.91% accuracy* on the BhashaBench-Ayur benchmark using a 3B parameter model, successfully outperforming comparable 2B/3B baselines by focusing on "Data Density" over raw compute.*

1. The Results: Punching Above Its Weight Class

Before discussing how we built it, let's look at what it achieved. Evaluation utilizes BhashaBench-Ayur, a rigorous benchmark aimed at preserving Indian Knowledge Systems (IKS), containing expert-level questions derived from BAMS/MD curriculums.

Benchmark Performance

While larger models like Gemma-2-27B still hold an advantage due to sheer scale, VaidhLlama successfully outperforms its direct base (Llama-3.2-3B) and remains competitive with other state-of-the-art small language models (SLMs).

Results were benchmarked using the EM metric. Full code and data are available at: https://github.com/viveks-codes/BhashaBench-Ayur-Results

Model	Parameters	Run 1 (%)	Run 2 (%)	Mean Accuracy	Note
VaidhLlama	3B	41.91%	41.91%	41.91%	Fine-tuned (Ours)
Llama-3.2-Instruct	3B	40.74%	40.74%	40.74%	Base Model
Gemma-2-Instruct	2B	41.00%	41.00%	41.00%	Comparable Size
Qwen2.5	3B	46.76%	46.76%	46.76%	Strong Baseline
Gemma-2-Instruct	27B	52.17%	52.17%	52.17%	Large Model Reference

Note on Data Coverage & Future Potential: Quoted accuracy is consistent across multiple runs, confirming deterministic evaluation. It is important to note that the 41.91% accuracy was achieved using only a partial subset (~10%) of our total curated logical frameworks. This suggests that the "ceiling" for this architecture is significantly higher once the full-scale dataset is processed and ingested.

The Specialist's Trade-off (Gain vs. Regression)

Intellectual honesty is key to scientific progress. Our analysis reveals that specialization comes at a cost, often described as the "Specialist's Curse." We observed a clear "Domain Shift" where the model sacrificed general breadth for clinical depth.

The data tells a compelling story of re-alignment:

Massive Clinical Gains: The model demonstrated a +100% improvement in purely clinical domains like Ayurvedic Diagnosis (Nidana) and Vajikarana (Sexology), alongside a strong +26.7% boost in Research Methodology.
The "Terminology Shift" (The Toxicology Paradox): Perhaps the most critical finding is in Toxicology. While the model's accuracy on generic "Toxicology" dropped by 33%, its performance on specialized "Ayurvedic Toxicology (Agada Tantra)" actually IMPROVED by 25%.

This confirms our hypothesis: VaidhLlama isn't just "forgetting"; it is specializing. It now prefers the specific logic of Agada Tantra over generic textbook definitions. We accept this trade-off: we are building a Vaidh (Specialist Doctor), not a generalist librarian.

2. Under the Hood: Architecture

VaidhLlama inherits the core transformer architecture from Llama-3.2-3B-Instruct, with specialized adaptations for traditional medicine applications. The model employs a dense decoder-only configuration optimized for edge-compatible inference.

Technical Specifications:

Base Model: Llama-3.2-3B (3.21B Parameters)
Architecture: Optimized Transformer (Decoder-only)
Context length: 128k supported (optimized for 4k clinical context)
Precision: bfloat16 for training, compatible with Int4 quantization
Vocabulary: Standard tokenizer aligned for Sanskrit/Ayurvedic terms

3. Data Preparation: High-Density Curation

Standard datasets rely on volume. VaidhLlama’s dataset relies on density. We processed raw Ayurvedic texts through a rigorous "Inverse Law" curation pipeline, reducing generic noise to amplify clinical signal.

Ethical Data Sourcing

Our data pipeline is built on a foundation of legally compliant sources. We curated a proprietary corpus from explicit-permission websites and digitized manuscripts obtained through formal partnerships with Ayurvedic universities. This ensures that VaidhLlama is trained on legitimate, high-quality academic knowledge rather than unverified web scrapes.

Pipeline Architecture:

We employed a multi-stage filtering process using NVIDIA NeMo Curator followed by synthetic scale-up via vLLM.

Logical Density Curation (NVIDIA NeMo)

Most medical datasets prioritize keyword frequency. We prioritized reasoning density. Using NVIDIA NeMo Curator, we built a custom AyurvedaQualityFilter that defined a rigorous taxonomy of 13+ clinical categories, ensuring that only texts containing deep reasoning (Nyayas) and expert physiology (Sharir Kriya) passed the gate.

class AyurvedaQualityFilter(DocumentFilter):
    """
    ULTIMATE FILTER:
    Includes: Core Topics + Ashtanga + Expert Physiology (Sharir Kriya)
    """
    def __init__(self, min_score=3):
        super().__init__()
        self.min_score = min_score 

        self.keywords = set([
            # --- 1. THE TRIDOSHAS (Bio-energies) ---
            "vata", "pitta", "kapha", "tridosha", "prakriti",

            # --- 13. EXPERT PHYSIOLOGY & SHARIR KRIYA (The Reasoning Core) ---
            # A. Philosophy & Logic
            "purvapaksha", "uttarapaksha", "sharir", "kriya", "padartha",

            # B. The NYAYAS (Laws of Nourishment - CRITICAL)
            "nyaya", "kshir-dadhi", "kedari-kulya", "khale-kapota",
        ])

    def score_document(self, text: str) -> float:
        # ... (Scoring logic prioritizing co-occurrence of these terms) ...
        return score

Synthetic Distillation (vLLM)

Teacher: Llama-3.3-70B-Instruct
Architecture: Threaded Producer-Consumer loop using vLLM on 8x GPUs.
Output: 130,954 high-quality, complex Q&A pairs (verified count).

4. Training Methodology: The "Unmasked" Strategy

VaidhLlama employs a specific variation of Supervised Fine-Tuning (SFT) often referred to as Continued Pre-training on the instruction set.

Unmasked Instruction Tuning (Full-Sequence Loss)

Standard instruction tuning masks the user prompt, calculating loss only on the model's response. For a niche domain like Ayurveda, this is suboptimal. The model must learn the complex syntax of the question itself (often containing Sanskrit slokas).

We disabled prompt masking, forcing the model to learn the joint probability of the entire sequence . This effectively acts as domain-adaptive pre-training mixed with instruction following.

# scripts/finetune.py
def formatting_func(example):
    # ... (Prompt construction) ...
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    tokenized = tokenizer(text, truncation=True, max_length=1024, padding="max_length")

    # UNMASKED TRAINING / CONTINUED PRE-TRAINING:
    # Unlike standard chat-tuning, we do not mask the prompt. 
    # We calculate loss on the full sequence so the model learns 
    # the syntax of the Ayurvedic questions (Sanskrit slokas) alongside the answers.
    labels = tokenized["input_ids"].copy()
    labels = [-100 if t == tokenizer.pad_token_id else t for t in labels]

    return {"input_ids": tokenized["input_ids"], "labels": labels}

Our WandB logs indicate that the model converged rapidly, hitting a performance plateau at approximately Step 2,356. This rapid convergence on a limited data subset confirms that our Unmasked instruction tuning is highly effective. Future runs we may want to optimize for this by implementing early stopping around the 2,400-step mark, allowing us to redirect compute toward processing the remaining 90% of our dataset.

5. Future Roadmap: Scaling

The success of VaidhLlama-3B at IIM Indore is just the beginning a PoC for what is possible with disciplined data engineering. However, to truly rival 70B+ models, we must scale our infrastructure.

The Next Phase: Integration & Scale

To move from "Prototype" to "Production," the roadmap requires closer integration with larger compute clusters and the broader BharatGen engineering ecosystem.

Deeper Integration with BharatGen Core: Transitioning our pipeline from isolated experimental setups to the central BharatGen infrastructure will allow us to train/finetune bigger models.
Cross-Institutional Synergy: The OCR pipeline at IIM Indore has created the data; the next step is engineering the large-scale pre-training, a task best suited for the robust compute environments available at our partner nodes (e.g., IIT Bombay).
Our Vision: Effectively bridging the unique data insights from subject matter experts with the high-performance engineering culture of our core technical teams is critical. We are fully prepared to facilitate this bridge, bringing the domain expertise developed here to the central engineering hub for the next phase of deployment.

Resources

Model Weights: Hugging Face (VaidhLlama-3.2-3B-Instruct)
Training Report: WandB Logs
Benchmark Data: GitHub Results Repo
Dataset Size: 130k+ Curated QA Pairs
Key Tech: NVIDIA NeMo Curator, vLLM, Unmasked SFT

Special thanks to my intern team, Riddhima, Viren, Pranav, Hiyaa, Adarsh, Dev, Niyati for their diligent work on the data scraping pipeline that fueled this project ❤️

DEV Community