Nicholas (Kosisochukwu) Ugbala

Posted on May 27

Fine-Tuning Llama 3.2 3B on Medical QA: Week 2- Data Preparation

#ai #machinelearning #llm #programming

What Happened This Week

Week 1 established the baseline. This week is where the actual engineering begins.

Before any fine-tuning can happen, the training data has to be in the exact format the model expects. That sounds simple. It is not. This week involved loading a 112K-row medical dataset, discovering it was the wrong dataset for the goal, switching to a different dataset, building a cleaning pipeline, and formatting everything into the Llama 3.2 chat template. Every step had a decision worth explaining.

The Wrong Dataset

The initial plan was to use lavita/medical-qa-datasets with the medical_meadow_medqa subset. Loading it and inspecting the samples revealed a problem I initially ignored.
The outputs looked like this:

OUTPUT: D: Trimethoprim-sulfamethoxazole
OUTPUT: A: The most important risk factors are hypertension and diabetes
OUTPUT: E: Pneumovax

These are answer selections, not clinical explanations. The dataset is USMLE multiple-choice questions. Training on this would produce a model that selects answer letters from five options, which is not the goal. The goal is a model that answers clinical questions in clear, factual prose.

The dataset was correct in provenance (NIH-sourced, board-exam quality) but incorrect in shape. Switching the dataset was the right call.

The Right Dataset

ChatDoctor HealthCareMagic 100K (lavita/ChatDoctor-HealthCareMagic-100K) is 112,165 real patient questions with doctor responses in prose format. Output looks like this:

"Fibrotic scarring in the right apical region of the lung may be due to past infection like tuberculosis. Fibrosis is a healed stage and generally does not require treatment. You may need to follow up with a chest physician for monitoring."

That is the output style the fine-tuned model should produce. Conversational, factual, direct.

The tradeoff: this is real forum data, not curated clinical text. Quality varies. Some responses are excellent clinical reasoning. Others are vague. The engineering problem for Week 2 was building a cleaning pipeline that keeps the signal and removes the noise.

The Cleaning Pipeline

Loading the raw dataset and inspecting samples revealed four specific problems:

Platform filler in outputs. Every response opens with noise that the model will learn and replicate:

"Hello, welcome to Chat Doctor..."
"Thanks for using Chat Doctor..."
"Hi Dear, Welcome to Chat Doctor..."
"and I hope I can help you today..."
"Thank you for posting your query..."

If these survive into training data, the fine-tuned model will learn to open every response the same way. That will be way worse than the base model's filler.

Trailing sign-offs. Outputs ended with:

"...Best wishes, Chat Doctor."
"...I hope this helps."
"...Take care."

Same problem with the filler starters for the output. These are social conventions from a forum platform, not clinical reasoning patterns worth learning.

Platform name artifacts in inputs. Some patients' input contained platform name mid-sentence, leaked in through copy-paste errors during data collection. Training on these teaches the model that "ChatDoctor" is a meaningful clinical term.

Output Quality Variance. Some outputs were too short to contain useful clinical content, fewer than 30 words. Some sequences were too long for the T4's VRAM budget when tokenized.

The cleaning function strips filler from both ends of every output using regex patterns, and the filter function rejects rows that failed quality thresholds. A second pass removed any samples where platform artifacts survived cleaning.

def clean_output(text):
    filler_starts = [
        r'^Hello[\w\s,]*Welcome to Chat\s?Doctor[.\s]*',
        r'^and I hope I can help you today\.?[\s]*',
        r'^Thank you for (posting|consulting|writing|using)[\w\s,]*[.\s]+',
        r'^Hello[\s,]+',
        r'^Hi[\s,]+',
        r'^Dear[\w\s,]+,[\s]*',
    ]
    for pattern in filler_starts:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE).strip()

    filler_ends = [
        r'[,.]?\s*Best wishes,?\s*Chat\s?Doctor\.?$',
        r'[,.]?\s*I hope (this|it) helps?\.?$',
        r'[,.]?\s*Take care\.?$',
    ]
    for pattern in filler_ends:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE).strip()

    return text

def is_clean(sample):
    if re.search(r'chatdoctor', sample['input'], re.IGNORECASE):
        return False
    if len(sample['output'].split()) < 30:
        return False
    if len(sample['input'].split()) + len(sample['output'].split()) > 600:
        return False
    return True

Result: 112,165 rows cleaned to 45,205. About 60% of the data was removed. It may seem like we now have less data to work with or to help improve our model. The point here is, A model trained on 45K clean samples will outperform one trained on 112K noisy set.

10,000 rows were then sampled randomly with seed=42 for reproducibility.

Formatting Into the Llama Chat Template

The model was trained on a specific conversation format. Feeding it data in any other structure will produce a corrupted training signal because the model does not know which tokens are the user's questions and which are the assistant's answers.

Every sample was converted from this:

{
    "instruction": "If you are a doctor, please answer the medical questions...",
    "input": "I have been having sharp chest pain on my left side...",
    "output": "Sharp chest pain that worsens with deep breathing..."
}

Into this:

<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
If you are a doctor, please answer the medical questions based on the patient's description.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
I have been having sharp chest pain on my left side for two days...
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Sharp chest pain that worsens with deep breathing is often pleuritic in nature...
<|eot_id|>

The instruction field is identical across 112K rows. It is a static task description, not a per-sample instruction. It belongs in the system prompt, which is a model's standing brief about its role. The patient's question goes in the user turn. The doctor's response goes in the assistant's turn.

add_generation_prompt=False during training, because the full assistant response is already in the data. The model is learning to produce that response, not being asked to generate it.

Token Length Distrubtion

Before finalising the dataset, the token length of every formatted samples was measured:

Shortest:    78 tokens
Longest:     794 tokens
Average:     261 tokens
Over 512:    110 samples (1.1%)
Over 1024:   0 samples

max_seq_length = 512 was chosen for training. Only 1.1% of samples exceed it, so truncation loss is negligible. Using 512 instead of 1024 means less VRAM per sequence, faster training, and the effective batch sizes on the T4.

Train and Eval Split

The 10K formatted samples were slit 90/10:

split = formatted.train_test_split(test_size=0.1, seed=42)
# Train: 9,000 samples
# Eval:  1,000 samples

I considered an 80/10/10 three-way split but ultimately decided not to go through with it. In fine-tuning, a separate test set adds little value. The model is not making architectural decisions based on held-out results. The eval set monitors training loss. The real qualitative test is the five baseline questions from Week 1 run through the fine-tuned model after training.
The cleaned dataset is published publicly on Hugging Face Hub for full reproducibility.

Why ChatDoctor and Not Something Better

ChatDoctor is not the highest-quality medical dataset available. PubMedQA has a better clinical provenance. Augmented MedQA with chain-of-thought reasoning would produce stronger results. A GPT-4 synthesised dataset from medical textbooks would be cleaner.

ChatDoctor was chosen for three specific reasons. First, the output format matches the goal: conversational prose responses to patient-described symptoms. PubMedQA produces yes/no research answers, not clinical explanations. MedQA is multiple choice. Neither matches the target output style. Second, it is publicly available, ungated, and immediately loadable without preprocessing overhead. Augmented chain-of-thought versions of MedQA do not exist as clean public datasets and would require GPT-4 generation to create, introducing a proprietary dependency. Third, the cleaning problem is real and representative: building a pipeline that filters 112K noisy forum rows to 45K usable samples is closer to production data engineering than loading a pre-sanitised benchmark. For a project demonstrating the full fine-tuning pipeline, that tradeoff is deliberate.

What's Next

Week 2 is done. The dataset is clean, formatted, split, and is live on Hugging Face.

Week 3 is the first LoRA fine-tuning run: configuring PEFT, setting up SFTTrainer, running training on the T4, and comparing the fine-tuned model's outputs against the Week 1 baseline. That is where the project works or reveals what needs fixing.
Cleaned dataset
healthcare-llm-finetune Repo

DEV Community