DEV Community: Nicholas (Kosisochukwu) Ugbala

Fine-Tuning Llama 3.2 3B on Medical QA: Week 4 - When Lower Loss Meant a Worse Model

Nicholas (Kosisochukwu) Ugbala — Tue, 16 Jun 2026 11:33:04 +0000

What Happened This Week

Week 3 produced a working fine-tuned model: one epoch, one dataset, a clear improvement over the base model. This week 4 was supposed to make it better with More data (a second dataset), two epochs, and a cleaner setup.

The eval loss dropped from 2.495 to 2.275. By that number alone, Week 4 was going to be a success.

The model was worse.

This is the story of how a better loss number hid a serious regression, how I diagnosed it, and what it took to actually fix it. It is one of the most useful things I have learned in this project.

The Plan

Four changes over Week 3:

Combine two datasets: ChatDoctor (conversational patient-doctor QA) and MedAlpaca WikiDoc (encyclopedic clinical reference), for both conversational style and factual grounding.
Use Llama's built-in pad token instead of adding a custom one to avoid an oversized adapter file.
Train for two epochs on the full dataset instead of one.
Switch evaluation to greedy decoding for reproducibility.

The Pad Token Fix

In Week 3, I added a custom pad token and resized the model's embedding layer. This had an unintended cost: PEFT saved the entire resized embedding layer alongside the LoRA adapters, producing a 3.19GB adapter file instead of the expected ~50MB.

Llama 3.2's tokenizer already ships with a reserved padding token, <|finetune_right_pad_id|> (token 128004), made for exactly this purpose. Using it instead of adding a new token:

...
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.padding_side = "right"
# No add_special_tokens, no resize_token_embeddings

No embedding resize means no embedding layer saved with the adapters. The Week 4 adapter came out at ~50MB. What I Learned: Before adding a special token, check whether the model already has one. Llama 3.2 did.

Building the Combined Dataset

ChatDoctor alone produced a model that answered in a conversational manner but sometimes lacked factual grounding. WikiDoc is reference-grade encyclopedic medical content. The Idea was that combining them would give both conversational style and factual grounding.

The first combine used 8,000 ChatDoctor and 4,000 WikiDoc, a 2:1 ratio. After cleaning and a 512-token length filter, this produced 10,255 rows: 9,229 train, 1,026 eval.

The cleaning itself was an exercise in diminishing returns.
ChatDoctor forum data carries platform filler ("Hello, welcome to Chat Doctor", "Hope this helps"), and the source has OCR-level corruption that breaks pattern matching ("HiT hanks" for "Hi. Thanks"). I built a two-pass regex cleaner plus a sentence-level trailing filler stripper that removes whole closing sentence containing filler keywords. It caught most of the noise. A small fraction of corruption resisted cleaning entirely, which I documented rather than chasing.

The Training Run That Looked Like Success

Two epochs, 1,154 steps, about four hours on a Kaggle T4.

Step 150:   train 2.499  |  eval 2.474
Step 300:   train 2.336  |  eval 2.340
Step 600:   train 2.282  |  eval 2.297
Step 900:   train 2.241  |  eval 2.279
Step 1050:  train 2.231  |  eval 2.275

A clean, healthy loss curve. Eval loss dropped steadily to 2.275, well below Week 3's 2.495. Train and eval tracked each other closely, indicating no classic overfitting. Mean token accuracy rose to 0.515.

Every number said the model improved.

The Regression

Then I ran the five test questions with greedy decoding.

The diabetes answer began correctly, then collapsed:

Eye yawning
Eye yawns
Eye years
Eye yolks
Eye yummy
Eye yogurt

A complete generation breakdown, under greedy decoding, which is supposed to be the stable option. The heart attack answer produced a runaway list that drifted from cardiac symptoms into sore throats and ear pain. Hypertension confidently recommended atenolol as first-line therapy, which is wrong: beta-blockers are not first-line for uncomplicated hypertension.

The model with the better loss number produced worse answers than Week 3's model.

Diagnosing It

Two things were happening, and separating them mattered.

First, the repetition penalty backfired. I had set no_repeat_ngram_size=3, which forbids repeating any three-token sequence. Once it generates a three-token phrase like "consult your doctor", it can never produce that exact phrase again in the same answer. The intent is to stop repetition loops. The effect was the opposite: when the model wanted to end a list by repeating a natural closing pattern, the rule forbade it, forcing a brand-new token every time. The only way to keep producing non-repeating tokens was to drift into nonsense: "Eye yummy, Eye yogurt." The setting meant to prevent loops was driving the degeneration.

Second, and more fundamental: the model had overfit to list generation. The combined dataset, especially the Wiki/doc half, contained many list-formatted answers. Two epochs reinforced a pattern: when answering, produce a list and keep extending it. On questions with naturally bounded answers (a mechanism or short cause), the model stayed controlled. On questions inviting enumeration (drugs, symptoms), it started a list and could not stop, eventually confabulating list items: invented drugs like "artuzofloxacin", invented symptoms.

The loss curve never showed this because loss measures next-token prediction accuracy on the eval set. A model can be better at predicting the next token while getting worse at producing a coherent, bounded, truthful answer.

The Fix

Three changes. applied together.

Rebalanced the data.* Dropped WikiDoc from 4,000 to 1,500 and raised ChatDoctor to 8,500, roughly 85% narrative prose, 15% encyclopedic. ChatDoctor's conversational answers train the model toward bounded, flowing responses rather than open-ended lists. This attacked the root behaviour.

Expanded the LoRA target modules. Those module names need a short explanation. Each layer of the model has two parts that do different jobs. The attention layers(q_proj, k_proj,v_proj, o_proj) decide what to pay attention to: how tokens relate to each other, how "the patient" connects to "their symptoms" later in the same question. The feed-forward layers (up_proj, down_proj, gate_proj) are where factual knowledge tends to be stored and retrieved; research shows they behave somewhat like a key-value memory, where a concept goes in and the associated facts come out.

Week 3 applied LoRA only to the attention layers, leaving the feed-forward layers frozen. That tuned how the model routes information but left the layers that hold the facts untouched. The confabulation, inventing drugs like "artuzofloxacin", was a factual recall failure: the model could not keep real drug names active while generating a list. So in Week 4, I added the feed-forward layers to the LoRA targets, letting fine-tuning adjust the part of the model where the facts live, not just the attention routing. (The claim that facts live in the feed-forward layers is a simplification; knowledge is distributed across the whole model. But as a reason to target those layers for a recall problem, it holds.)

Fixed generation. Removed no_repeat_ngram_size entirely. Set eos_token_id explicitly to <|eot_id|> so the model can actually stop. Used repetition_penalty=1.3 to discourage loops without a hard ngram ban, and capped max_new_tokens at 256.

outputs = model.generate(
    input_ids=encoded_inputs["input_ids"],
    attention_mask=encoded_inputs["attention_mask"],
    max_new_tokens=256,
    do_sample=False,
    repetition_penalty=1.3,
    eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"),
    pad_token_id=tokenizer.pad_token_id,
)

One epoch on the rebalanced data.

The Result

The degeneration was gone. Hypertension named four real first-line drug classes (ACE inhibitors, ARBs, beta-blockers, calcium channel blockers) and stopped. Malaria named real treatments (artemether-lumefantrine, chloroquine, mefloquine) and stopped. The diabetes and iron deficiency answers stayed accurate. The heart attack answer, which had failed in every previous run, finally produced seven correct cardiac warning signs and stopped.

And running the same question twice produced byte-for-byte identical output. Greedy decoding made the results reproducible, which is what makes the claims defensible. In Week 3, the same question could give a good answer one run and collapse the next. Now the model's behaviour is consistent and verifiable.

What I Actually Learned

Lower Loss is not a better model. Eval loss measures next-token prediction. It does not measure factual accuracy, coherence, or whether the model knows when to stop. The Week 4 two-epoch model had the best loss and the worst generation. I wouldn't have caught this if I went with the notion that decreased loss equals a better-performing model, and not manually test the model's output.

Generation settings are not an afterthought. The same weights produced a total collapse or a clean answer depending on the decoding configuration. A repetition penalty meant to help actively caused the degeneration. Half of the battle with a small model is how you decode it.

Small models have a ceiling. A 3B fine-tuned on consumer hardware handles clinical QA well but struggles to enumerate without confabulating. Rebalancing the data and expanding LoRA targets pushed that ceiling up, but it is a real limit. In production, the answer would be a larger base model. Naming the constraint honestly gives clarity on how to go about making better improvements.

Reproducibility is a feature you build in. Greedy decoding, a fixed seed, a pinned data sample. Without these, "the model does X" is not a claim I can stand behind.

Where the Model Lives

nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned-v2

Adapter file: ~50MB, the pad token fix working as intended.

What's Next

Week 5 wraps the model in a FastAPI inference endpoint, containerises it with Docker, and deploys it to a public URL any one can call. The generation settings worked out this week become the server's defaults.

Model: huggingface.co/nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned-v2
Dataset: huggingface.co/datasets/nicholas-ugbala-hf/medical-qa-narrative-10k
Repo: github.com/nicholas-ugbala-dev/healthcare-llm-finetune

Fine-Tuning Llama 3.2 3B on Medical QA: Week 3 - The First Training Run

Nicholas (Kosisochukwu) Ugbala — Thu, 04 Jun 2026 09:48:18 +0000

What Happened This Week

Week 2 ended with a cleaned dataset formatted and pushed to Hugging Face Hub. Week 3 is where the actual fine-tuning happened: configuring LoRA adapters, running the training loop, and comparing the fine-tuned model's outputs against the Week 1 baseline.

It did not go smoothly. This post documents what broke, why it broke, and what the results actually showed.

The Stack for This Week

The fine-tuning stack builds on top of what was installed in Week 1:

peft – implements LoRA. It adds small trainable adapter matrices to specific layers of the frozen base model.
trl – provides SFTTrainer, the supervised fine-tuning training loop. It handles batching, gradient accumulation, evaluation, and checkpointing.
bitsandbytes – still handling 4-bit quantization so the 3B model fits in GPU memory.

What LoRA (Low-Rank Adaptation) is Actually Doing

Before getting into the training run, it is worth explaining what LoRA does because it is the core technique of this project.

The base model has 3.2 billion parameters. Updating all of them during fine-tuning would require roughly 24GB of VRAM and hours of compute. That is not feasible on a free GPU.

LoRA does not update the original weights at all. Instead, it adds two small trainable matrices alongside specific layers. Every attention layer in the model has weight matrices for query, key, value, and output projections. For each of these, LoRA adds:

Matrix A: 16 x 3072 = 49,152 parameters
Matrix B: 3072 x 16 = 49,152 parameters

During the forward pass, the layer computes:

W: frozen base model weights.

scale: controls the strength of the adjustment (scale = lora_alpha / r)

output = (W x input) + scale (B x A x input)

W is frozen. Only A and B receive gradient updates. Across 28 transformer layers and 4 target modules per layer, this amounts to 9 million trainable parameters out of 3.2 billion total. That is 0.28% of the model being trained. The original weights stay completely untouched.

The rank value 16 controls the adapter's capacity. Higher rank means more expressive adapters but more parameters to train. 16 is the standard starting point for a 3B model.

What `prepare_model_for_kbit_training` Does

This is a function called before applying the LoRA adapters, and it is worth explaining precisely.

The model is loaded in 4-bit quantization, which means its weights are compressed and frozen. Before training, PyTorch needs to know how to propagate gradients backward through those frozen quantized layers to reach the LoRA adapters.

By default, PyTorch sees frozen layers and stops tracking gradient flow through them entirely. The error signal from the loss never reaches the LoRA matrices. No gradient, no learning.

prepare_model_for_kbit_training tells PyTorch to keep passing gradients through the frozen layers even though those layers are not being updated. The frozen layers are passthrough nodes in the gradient computation graph. The LoRA adapters at the end of the chain receive the gradient and update accordingly.

Without this call, training completes without error, but the LoRA weights never change. The model would be identical to the base model after training.

Encountering Hardware Challenge

The initial plan was to train on Google Colab's free T4 GPU with fp16 mixed precision enabled. This failed.

The error was:

NotImplementedError: "_amp_foreach_non_finite_check_and_unscale_cuda"
not implemented for 'BFloat16'

The fp16 gradient scaler encountered BFloat16 tensors created internally by prepare_model_for_kbit_training during the backward pass. The T4 supports float16 natively, but the interaction between the quantized base model layers and the LoRA adapter initialization in newer PEFT versions produces BFloat16 intermediate tensors that the fp16 scaler cannot process.

The working solution was to disable mixed precision entirely with fp16=False, bf16=False and move to Kaggle's T4 GPU, which provided a 30-hour session limit per week instead of Colab's 2-4-hour cutoff.

The tradeoff: float32 gradient computation is roughly 2 to 3x slower than fp16 on a T4. The training run that would have taken 45 minutes with fp16 took 1 hour 13 minutes without it.

This is a library version compatibility issue specific to this combination of TRL 1.5.1, PEFT, and T4 hardware. On an A100 or V100, bf16=True or fp16=True respectively, would work without any of these conflicts.

Training Configuration

training_args = SFTConfig(
    output_dir="/kaggle/working/checkpoints",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=50,
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    gradient_checkpointing=True,
    logging_steps=25,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True,
    report_to="none",
)

A few decisions worth explaining:
gradient_accumulation_steps=4: with batch size 4, each weight update uses the accumulated gradient from 16 samples (4 x 4). This simulates a larger batch without the VRAM cost of loading 16 samples simultaneously.

gradient_checkpointing=True: instead of storing all intermediate layer activations in VRAM during the forward pass, PyTorch recomputes them during the backward pass when needed. Significant VRAM savings at the cost of roughly 20% slower training. On a 15.6GB T4 running float32, this was necessary to avoid OOM(OutOfMemoryError: CUDA out of memory).

load_best_model_at_end=True: after training completes, load the checkpoint with the lowest eval loss rather than the final checkpoint. This protects against slight overfitting in the last few steps.

learning_rate=2e-4: the standard starting point for LoRA fine-tuning. It controls how aggressively the adapter weights are updated per step.

Training Results

Step 100:  train loss 2.570  |  eval loss 2.558
Step 200:  train loss 2.525  |  eval loss 2.511
Step 300:  train loss 2.482  |  eval loss 2.495

Final: 309 steps, 1 epoch, 4,937 samples, 1h 13m

Training loss decreased steadily across all 309 steps. Eval loss tracked it closely with a gap of only 0.013 at the final step. A large and growing gap between training and eval loss would indicate overfitting. The consistent small gap indicates the model is generalizing to unseen examples, not just memorizing the training data.

The loss was still declining at step 309, which means training stopped before full convergence. This is expected for one epoch. Week 4 will run two epochs on the full 9,000-sample dataset.

The System Prompt Is Part of the Model

This is also worth noting: the system prompt used during inference must match exactly what was used during training.

Every training sample was formatted with:

If you are a doctor, please answer the medical questions
based on the patient's description.

The model spent 309 steps learning to produce clinical prose in response to that specific framing. Using a different system prompt at inference time places the model in a context it was never fine-tuned on and degrades output quality. This is a subtle but important constraint that needs mentioning.

Before and After

These are the same five questions asked of the base model in Week 1 and the fine-tuned model after training.

The critical comparison — type 2 diabetes:

Base model:

"When your body produces more insulin, it can cause your body to hold onto more water, leading to increased thirst."

Fine-tuned model:

"Increased thirst and urination: High blood sugar levels can cause the body to produce more urine, leading to dehydration and increased thirst."

The hallucination is gone. The base model fabricated a causal link between insulin and water retention. The fine-tuned model correctly attributes increased thirst to high blood sugar levels causing osmotic diuresis. That is the exact mechanism this project was designed to fix.

What else improved:

Filler openers are completely gone across all five responses. The base model opened every answer with "As a medical assistant, I'd be happy to help you..." None of the fine-tuned responses contain this pattern. Responses go directly into clinical content.

The hypertension response improved significantly. The fine-tuned model stratifies treatment by severity and names specific drug classes with examples. The base model was more generic.

What still needs work:

The diabetes response lists "slow speech" as a symptom of type 2 diabetes. It is not. Slow speech is associated with hypoglycemia or stroke, not early-stage type 2. One epoch over 4,937 samples did not eliminate all hallucinations.

The heart attack response includes "coughing up blood" and "pain or discomfort in the face, especially the cheeks or forehead" as warning signs. Neither is a classic cardiac symptom. One response ends with "Hope this answers your question. Let me know if I can assist you further" — a filler pattern that survived the cleaning pipeline.

These residual errors are the specific targets for Week 4.

Where the Model Lives

The fine-tuned adapter weights are published on Hugging Face Hub:

nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned

The repo contains the LoRA adapter weights (adapter_model.safetensors), the adapter configuration (adapter_config.json), and the modified tokenizer with the custom pad token. The base model stays at Meta's repo. Loading the fine-tuned model requires both.

One note on file size: adapter_model.safetensors is 3.19GB rather than the expected 30 to 60MB for LoRA adapters. This is because PEFT saved the full embedding layer alongside the adapters, since the tokenizer vocabulary was extended by one token during training. The functionality is identical but the file size reflects this tradeoff.

What's Next

Week 3 confirmed the pipeline works and produced measurable improvement. The residual errors are clearly identified.

Week 4 targets: full 9,000 sample dataset, two epochs, tighter data cleaning to remove surviving filler patterns, and a second dataset to improve factual grounding.

Model: huggingface.co/nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned
Repo: github.com/nicholas-ugbala-dev/healthcare-llm-finetune
Dataset: huggingface.co/datasets/nicholas-ugbala-hf/chatdoctor-cleaned-10k

Fine-Tuning Llama 3.2 3B on Medical QA: Week 2- Data Preparation

Nicholas (Kosisochukwu) Ugbala — Wed, 27 May 2026 08:11:36 +0000

What Happened This Week

Week 1 established the baseline. This week is where the actual engineering begins.

Before any fine-tuning can happen, the training data has to be in the exact format the model expects. That sounds simple. It is not. This week involved loading a 112K-row medical dataset, discovering it was the wrong dataset for the goal, switching to a different dataset, building a cleaning pipeline, and formatting everything into the Llama 3.2 chat template. Every step had a decision worth explaining.

The Wrong Dataset

The initial plan was to use lavita/medical-qa-datasets with the medical_meadow_medqa subset. Loading it and inspecting the samples revealed a problem I initially ignored.
The outputs looked like this:

OUTPUT: D: Trimethoprim-sulfamethoxazole
OUTPUT: A: The most important risk factors are hypertension and diabetes
OUTPUT: E: Pneumovax

These are answer selections, not clinical explanations. The dataset is USMLE multiple-choice questions. Training on this would produce a model that selects answer letters from five options, which is not the goal. The goal is a model that answers clinical questions in clear, factual prose.

The dataset was correct in provenance (NIH-sourced, board-exam quality) but incorrect in shape. Switching the dataset was the right call.

The Right Dataset

ChatDoctor HealthCareMagic 100K (lavita/ChatDoctor-HealthCareMagic-100K) is 112,165 real patient questions with doctor responses in prose format. Output looks like this:

"Fibrotic scarring in the right apical region of the lung may be due to past infection like tuberculosis. Fibrosis is a healed stage and generally does not require treatment. You may need to follow up with a chest physician for monitoring."

That is the output style the fine-tuned model should produce. Conversational, factual, direct.

The tradeoff: this is real forum data, not curated clinical text. Quality varies. Some responses are excellent clinical reasoning. Others are vague. The engineering problem for Week 2 was building a cleaning pipeline that keeps the signal and removes the noise.

The Cleaning Pipeline

Loading the raw dataset and inspecting samples revealed four specific problems:

Platform filler in outputs. Every response opens with noise that the model will learn and replicate:

"Hello, welcome to Chat Doctor..."
"Thanks for using Chat Doctor..."
"Hi Dear, Welcome to Chat Doctor..."
"and I hope I can help you today..."
"Thank you for posting your query..."

If these survive into training data, the fine-tuned model will learn to open every response the same way. That will be way worse than the base model's filler.

Trailing sign-offs. Outputs ended with:

"...Best wishes, Chat Doctor."
"...I hope this helps."
"...Take care."

Same problem with the filler starters for the output. These are social conventions from a forum platform, not clinical reasoning patterns worth learning.

Platform name artifacts in inputs. Some patients' input contained platform name mid-sentence, leaked in through copy-paste errors during data collection. Training on these teaches the model that "ChatDoctor" is a meaningful clinical term.

Output Quality Variance. Some outputs were too short to contain useful clinical content, fewer than 30 words. Some sequences were too long for the T4's VRAM budget when tokenized.

The cleaning function strips filler from both ends of every output using regex patterns, and the filter function rejects rows that failed quality thresholds. A second pass removed any samples where platform artifacts survived cleaning.

def clean_output(text):
    filler_starts = [
        r'^Hello[\w\s,]*Welcome to Chat\s?Doctor[.\s]*',
        r'^and I hope I can help you today\.?[\s]*',
        r'^Thank you for (posting|consulting|writing|using)[\w\s,]*[.\s]+',
        r'^Hello[\s,]+',
        r'^Hi[\s,]+',
        r'^Dear[\w\s,]+,[\s]*',
    ]
    for pattern in filler_starts:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE).strip()

    filler_ends = [
        r'[,.]?\s*Best wishes,?\s*Chat\s?Doctor\.?$',
        r'[,.]?\s*I hope (this|it) helps?\.?$',
        r'[,.]?\s*Take care\.?$',
    ]
    for pattern in filler_ends:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE).strip()

    return text

def is_clean(sample):
    if re.search(r'chatdoctor', sample['input'], re.IGNORECASE):
        return False
    if len(sample['output'].split()) < 30:
        return False
    if len(sample['input'].split()) + len(sample['output'].split()) > 600:
        return False
    return True

Result: 112,165 rows cleaned to 45,205. About 60% of the data was removed. It may seem like we now have less data to work with or to help improve our model. The point here is, A model trained on 45K clean samples will outperform one trained on 112K noisy set.

10,000 rows were then sampled randomly with seed=42 for reproducibility.

Formatting Into the Llama Chat Template

The model was trained on a specific conversation format. Feeding it data in any other structure will produce a corrupted training signal because the model does not know which tokens are the user's questions and which are the assistant's answers.

Every sample was converted from this:

{
    "instruction": "If you are a doctor, please answer the medical questions...",
    "input": "I have been having sharp chest pain on my left side...",
    "output": "Sharp chest pain that worsens with deep breathing..."
}

Into this:

<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
If you are a doctor, please answer the medical questions based on the patient's description.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
I have been having sharp chest pain on my left side for two days...
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Sharp chest pain that worsens with deep breathing is often pleuritic in nature...
<|eot_id|>

The instruction field is identical across 112K rows. It is a static task description, not a per-sample instruction. It belongs in the system prompt, which is a model's standing brief about its role. The patient's question goes in the user turn. The doctor's response goes in the assistant's turn.

add_generation_prompt=False during training, because the full assistant response is already in the data. The model is learning to produce that response, not being asked to generate it.

Token Length Distrubtion

Before finalising the dataset, the token length of every formatted samples was measured:

Shortest:    78 tokens
Longest:     794 tokens
Average:     261 tokens
Over 512:    110 samples (1.1%)
Over 1024:   0 samples

max_seq_length = 512 was chosen for training. Only 1.1% of samples exceed it, so truncation loss is negligible. Using 512 instead of 1024 means less VRAM per sequence, faster training, and the effective batch sizes on the T4.

Train and Eval Split

The 10K formatted samples were slit 90/10:

split = formatted.train_test_split(test_size=0.1, seed=42)
# Train: 9,000 samples
# Eval:  1,000 samples

I considered an 80/10/10 three-way split but ultimately decided not to go through with it. In fine-tuning, a separate test set adds little value. The model is not making architectural decisions based on held-out results. The eval set monitors training loss. The real qualitative test is the five baseline questions from Week 1 run through the fine-tuned model after training.
The cleaned dataset is published publicly on Hugging Face Hub for full reproducibility.

Why ChatDoctor and Not Something Better

ChatDoctor is not the highest-quality medical dataset available. PubMedQA has a better clinical provenance. Augmented MedQA with chain-of-thought reasoning would produce stronger results. A GPT-4 synthesised dataset from medical textbooks would be cleaner.

ChatDoctor was chosen for three specific reasons. First, the output format matches the goal: conversational prose responses to patient-described symptoms. PubMedQA produces yes/no research answers, not clinical explanations. MedQA is multiple choice. Neither matches the target output style. Second, it is publicly available, ungated, and immediately loadable without preprocessing overhead. Augmented chain-of-thought versions of MedQA do not exist as clean public datasets and would require GPT-4 generation to create, introducing a proprietary dependency. Third, the cleaning problem is real and representative: building a pipeline that filters 112K noisy forum rows to 45K usable samples is closer to production data engineering than loading a pre-sanitised benchmark. For a project demonstrating the full fine-tuning pipeline, that tradeoff is deliberate.

What's Next

Week 2 is done. The dataset is clean, formatted, split, and is live on Hugging Face.

Week 3 is the first LoRA fine-tuning run: configuring PEFT, setting up SFTTrainer, running training on the T4, and comparing the fine-tuned model's outputs against the Week 1 baseline. That is where the project works or reveals what needs fixing.
Cleaned dataset
healthcare-llm-finetune Repo

Fine-Tuning Llama 3.2 3B on Medical QA: Week 1 Setup and Baseline Inference

Nicholas (Kosisochukwu) Ugbala — Tue, 19 May 2026 09:56:18 +0000

The Problem With General-Purpose LLMs in Healthcare

Ask a general-purpose LLM about the early symptoms of type 2 diabetes and it might tell you:

"When your body produces more insulin, it can cause your body to hold onto more water, leading to increased thirst."

That is wrong. Increased thirst in diabetes is caused by high blood glucose pulling fluid from tissues, not insulin. The model arrived at that answer through superficial pattern matching: insulin, fluid retention, and thirst are highly correlated keywords in general web text, so the model linked them causally. In a general conversation, that kind of error is a minor annoyance. In a clinical context, it erodes patient trust and can lead to incorrect self-triage.
This is the core problem with using general-purpose LLMs in healthcare: they are medically coherent on the surface but clinically unreliable underneath. Fine-tuning on a curated medical dataset penalizes these superficial token correlations and forces the model to adhere to verified clinical pathways.
That is what this project is about.

What I Am Building and Why

I am fine-tuning Llama 3.2 3B Instruct on a medical question-answering dataset, deploying it as a public inference API, and documenting the full pipeline from dataset choice to evaluation to deployment.
The goal is to learn the complete fine-tuning pipeline: data preparation, LoRA training, evaluation, and deployment. Healthcare is the domain for this project, but the pipeline is domain-agnostic. Every decision made here applies equally to legal QA, customer support, or any other vertical that needs a specialised language model.

By the end of the project, I will have:

A public GitHub repo with training code and reproducibility instructions
A fine-tuned model checkpoint on Hugging Face Hub
A live FastAPI inference endpoint
A technical write-up of every decision made and why

Model and Dataset Choice

Base Model: Llama 3.2 3B Instruct

The 3B sits in the right spot for the constraints of this project. It is capable enough to produce meaningful answers on medical QA even before fine-tuning, so the baseline is not embarrassing and the demo is not hollow. It is small enough to train on a free-tier T4 GPU with QLoRA. And it has the richest fine-tuning ecosystem of any open model family.

Dataset: MedQuAD (via lavita/medical-qa-datasets on Hugging Face)

MedQuAD is sourced from the USMLE, the United States Medical Licensing Examination. These are board exam questions written and validated by medical professionals. The data is clean, authoritative, and NIH-sourced. Provenance matters in healthcare AI. Using forum-scraped data might produce a more conversational model, but you cannot defend the quality of the training data. With MedQuAD, you can.

An honest caveat: It is worth acknowledging that the USMLE questions reflect US clinical guidelines; it is not a dealbreaker for a portfolio project demonstrating the fine-tuning pipeline. A production system deployed globally would require localisation.

Infrastructure and Deployment

Training Compute: Google Colab (NVIDIA T4 GPU, 15.8GB VRAM). Colab is a cloud-hosted notebook environment that provisions GPU compute on demand. Think of it as a rented virtual machine with a GPU attached, accessible entirely through the browser. It is the right tool for a training run at this scale: no local GPU required, no cloud provisioning overhead.

Model Hosting: Hugging Face Hub. The GitHub of ML models. Model weights, versioned checkpoints, and model cards all live here publicly.
Inference API: FastAPI. A lightweight Python web framework for wrapping the model in an HTTP endpoint.
Containerisation: Docker. The FastAPI inference server is containerised, so the deployment is reproducible.
Fallback GPU: RunPod or Lambda Labs, if the free Colab tier hits limits on the larger training run in Week 4.

The Library Stack

torch (PyTorch)

The foundation on which everything else runs. PyTorch is a numerical computation engine. It handles tensors (multidimensional arrays), moves data between CPU and GPU, and executes the mathematical operations that make the model run. Every other library in this stack is built on top of it.

Why PyTorch and not TensorFlow? The open-source LLM ecosystem, Meta, Mistral, Qwen, ships in PyTorch.

transformers (Hugging Face)

The library that knows how to talk to a specific model. Every model architecture has its own structure: different layer names, tokenization logic, and chat templates. transformers abstracts all of that.

AutoModelForCausalLM: "Auto" means inspect the model config, determine the architecture, and load the correct class automatically. "ForCausalLM" means load it in text-generation mode (predict the next token), not classification mode.

AutoTokenizer: converts raw text into token IDs the model understands, and converts the model's output IDs back into readable text.

apply_chat_template: Llama 3.2 was trained on a specific format using <|system|>, <|user|>, and <|assistant|> tags. Feed it raw text without that structure and the model does not know where the user's question ends and its answer should begin. apply_chat_template applies the correct format automatically.

bitsandbytes (via BitsAndBytesConfig)

This is what enables 4-bit quantization, the technique that makes a 3B parameter model fit on a 15.8GB T4 GPU.

A 3 billion parameter model is 3 billion numbers stored in memory. By default, each number takes 32 bits of space. That puts the raw model at approximately 12GB just to load, before a single token has been processed.

peft

Parameter-Efficient Fine-Tuning. The library that implements LoRA.

Training all 3 billion parameters from scratch would require roughly 24GB of VRAM and hours of compute. LoRA instead adds small trainable adapter matrices to specific layers and trains only those, approximately 1 to 5% of total parameters. The original weights stay frozen. This is what makes fine-tuning feasible on consumer hardware.

trl (SFTTrainer)

Handles the supervised fine-tuning training loop. It manages dataset formatting, gradient accumulation, metric logging, and checkpoint saving. Without it, you write the training loop manually in raw PyTorch, which is correct but tedious.

datasets (Hugging Face)

Standardised interface for loading, filtering, and splitting datasets. Handles streaming large datasets without loading everything into RAM at once. One line to load MedQuAD, one line to split into train and eval.

accelerate (Hugging Face)

Device management. When you write device_map="auto", Accelerate determines how to distribute the model across available hardware. Mostly called under the hood by other libraries, so you rarely interact with it directly.

Loading the Model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "meta-llama/Llama-3.2-3B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({'pad_token': '<pad>'})
model.resize_token_embeddings(len(tokenizer))
tokenizer.padding_side = "right"

A 3 billion parameter model is essentially 3 billion numbers stored in memory. By default, each number takes up 32 bits of space. That is a lot: imagine trying to fit a library into a small room.
Quantization is the process of shrinking each number to take up less space. Instead of 32 bits per number, we store it in 4 bits. The room is the same size, but now the books are smaller, so everything fits. For context, it reduces the books to a size of very tiny stickers; the real size will be a microscopic book.
The tradeoff is a tiny loss in precision; the numbers are slightly less exact. But for most tasks, including medical QA, the model performs well enough that you would not notice the difference.
Quantization shrinks each parameter. The config specifically uses:
load_in_4bit=True: store weights in 4-bit instead of 32-bit. Now the 3B model loads in ~2GB.

bnb_4bit_quant_type=”nf4”: NormalFloat4. A smarter 4-bit format designed specifically for neural network weights, which follow a roughly normal distribution. Better quality than a naive 4-bit rounding.

bnb_4bit_compute_dtype=torch.float16: even though weights are stored in 4-bit, actual computation (matrix multiplication) happens in float16, we can’t do math in 4-bit, we decompress on the fly, compute, then recompress. Float16 is the middle ground for accuracy and speed.
bnb_4bit_use_double_quant=True: quantize the quantization constants too. Saves a small additional amount of VRAM.

The Tokenizer

The model cannot read text. It only understands numbers.

Before any text enters the model, it is converted into a sequence of numbers called token IDs. After the model generates numbers as output, something converts them back into readable text. That is the tokenizer's job. It sits at the entry and exit point of the model.

The tokenizer does not split text into letters or whole words. It splits into tokens, which are chunks that could be a full word, part of a word, punctuation, or a special marker.

text = "What causes iron deficiency anemia?"

tokens = tokenizer.tokenize(text)
# ['What', 'Ġcauses', 'Ġiron', 'Ġdeficiency', 'Ġanemia', '?']

ids = tokenizer.encode(text)
# [3923, 11384, 11245, 32090, 42075, 30]

The model sees [3923, 11384, 11245, 32090, 42075, 30]. Never the actual words.

Longer medical words split into multiple tokens:

tokenizer.tokenize("hypertension")
# ['hyper', 'tension']   two tokens, one word

tokenizer.tokenize("acetaminophen")
# ['acet', 'amin', 'oph', 'en']   four tokens, one word

This is why token count and word count are different numbers, and why models have a context window measured in tokens, not words.

tokenizer.add_special_tokens({'pad_token': '<pad>'}): Llama's tokenizer has no dedicated padding token. Without one, the tokenizer reuses the end-of-sequence token for padding, which means the model cannot distinguish between "end of response" and "this is padding." Adding a distinct pad token removes that ambiguity.

model.resize_token_embeddings(len(tokenizer)): the model's vocabulary just grew by one token. This line tells the model about it. Without it, the model would not know how to handle the new pad token and would crash.

Running Inference

Part 1: Encoding the Input

def generate_response(prompt):
    messages = [
        {"role": "system", "content": "You are a helpful medical assistant."},
        {"role": "user", "content": prompt}
    ]

    encoded_inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        padding=True
    ).to(model.device)

apply_chat_template converts the messages list into the exact string format Llama 3.2 was trained on:

<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful medical assistant.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What causes iron deficiency anemia?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

add_generation_prompt=True: adds the final <|start_header_id|>assistant<|end_header_id|> line. Without it, the model has no signal that the user has finished speaking and it is now expected to respond. Use True during inference. Use False during training, where the full assistant response is already in the data.

return_tensors="pt": packages the token IDs as a PyTorch tensor. The model cannot consume a plain Python list or a NumPy array. "pt" means PyTorch.

return_dict=True: returns both input_ids and attention_mask as a dictionary instead of just the token IDs tensor.

The attention mask is a companion tensor to input_ids, same length, containing only 1s and 0s. 1 means real token, pay attention to it. 0 means padding, ignore it. Without it, the model treats padding as real content and the output degrades, especially during training.

.to(model.device): moves the tensor from system RAM (CPU memory) to VRAM (GPU memory), where the model lives. Data and model must be in the same memory location.model.device resolves to "cuda:0" on the T4, and works on CPU machines too, unlike hardcoding "cuda".

Part 2: Generating and Decoding

    outputs = model.generate(
        input_ids=encoded_inputs["input_ids"],
        attention_mask=encoded_inputs["attention_mask"],
        max_new_tokens=1000,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id
    )

    input_length = encoded_inputs["input_ids"].shape[-1]
    response = outputs[0][input_length:]
    return tokenizer.decode(response, skip_special_tokens=True)

max_new_tokens=1000: hard ceiling on tokens generated. This counts only new tokens, not the input. Without it the model generates indefinitely.

do_sample=True: sample from the probability distribution instead of always picking the highest-scoring token. Deterministic generation (False) produces repetitive, robotic text. For medical QA, we want coherent prose, so sampling gives that.

temperature=0.7: controls how sharp or flat the probability distribution is before sampling. Think of it as a confidence dial. At 0.1, the model always picks the most predictable next word. At 1.5, it picks surprising ones. 0.7 sits between them: responses are grounded and accurate, but not so rigid they read like a dictionary entry.

top_p=0.9: nucleus sampling. After temperature reshapes the distribution, top_p cuts off the long tail. It ranks all tokens by probability, sums them from highest to lowest, and stops at 90%. Only tokens within that 90% are considered. The unlikely, potentially nonsensical tokens at the bottom get discarded. Temperature and top_p work together: one reshapes, the other prunes.

pad_token_id=tokenizer.pad_token_id: explicitly tells the generate function which token represents padding. Removes any remaining ambiguity.

outputs[0][input_length:]: model.generate returns the full sequence, input tokens plus generated tokens concatenated. This slice discards the input and keeps only what the model generated.

tokenizer.decode(response, skip_special_tokens=True): converts token IDs back to readable text and strips structural tokens like <|eot_id|> and <|end_of_text|> from the output.

Baseline Results

The model was run against five test questions before any fine-tuning. These outputs serve as the benchmark. Every evaluation after training will be compared directly against them.

Q: What are the early symptoms of type 2 diabetes?)
A: assistant

As a medical assistant, I'd be happy to help you identify the early symptoms of type 2 diabetes.

Type 2 diabetes is often referred to as "silent diabetes" because it can be asymptomatic for a long time, especially in the early stages. However, some people may experience the following early symptoms:

1. **Increased thirst and hunger**: When your body produces more insulin, it can cause your body to hold onto more water, leading to increased thirst and hunger.
2. **Frequent urination**: As your body tries to flush out excess glucose, you may need to urinate more frequently, especially at night.
3. **Fatigue**: High blood sugar levels can cause fatigue, which can be mistaken for other conditions.
4. **Blurred vision**: High blood sugar levels can cause the lens in your eye to swell, leading to blurred vision.
5. **Slow healing of cuts and wounds**: High blood sugar levels can impede the healing process, leading to slower recovery from cuts and wounds.
6. **Tingling or numbness in your hands and feet**: High blood sugar levels can damage nerve endings, causing numbness or tingling sensations in your hands and feet.
7. **Recurring skin, gum, or bladder infections**: High blood sugar levels can weaken your immune system, making you more susceptible to infections.
8. **Weight gain or loss**: Some people may experience weight gain or loss due to the changes in their metabolism.
9. **Mood changes**: High blood sugar levels can cause mood swings, anxiety, and depression.

Keep in mind that these symptoms can be subtle and may not always be present. If you're concerned about your risk of developing type 2 diabetes, it's essential to talk to your healthcare provider about your individual risk factors and get regular check-ups.

Remember, early detection and treatment can make a significant difference in managing type 2 diabetes. If you're experiencing any of these symptoms, don't hesitate to consult with your healthcare provider.
============================================================
Q: How is malaria diagnosed and treated?)
A: assistant

As a medical assistant, I'd be happy to explain the diagnosis and treatment of malaria.

**Diagnosis:**

Malaria is typically diagnosed using a combination of clinical evaluation, laboratory tests, and rapid diagnostic tests (RDTs). Here are the steps involved in diagnosing malaria:

1. **Clinical evaluation:** A healthcare provider will ask about symptoms, medical history, and travel history to determine if the patient has malaria.
2. **Physical examination:** A physical examination will be performed to look for signs of malaria, such as fever, chills, and flu-like symptoms.
3. **Blood tests:** A blood sample will be taken to check for the presence of malaria parasites in the blood. This can be done using:
    * **Microscopy:** A microscope is used to examine the blood sample under a microscope to look for the parasites.
    * **Rapid diagnostic tests (RDTs):** RDTs are quick and easy-to-use tests that detect the presence of malaria antigens in the blood.
    * **Polymerase chain reaction (PCR):** PCR is a molecular test that detects the genetic material of the malaria parasite.
4. **Confirmatory testing:** If the initial tests are positive, further testing may be done to confirm the diagnosis.

**Treatment:**

Treatment for malaria typically involves antimalarial medications. The type and dosage of medication will depend on the type of malaria, the patient's age, weight, and medical history, as well as any underlying medical conditions.

**Common antimalarial medications:**

1. **Artemisinin-based combination therapies (ACTs):** These are the most commonly used medications for treating malaria. Examples include artesunate and mefloquine.
2. **Quinine:** This is an older medication that is still used to treat malaria, particularly in areas where ACTs are not available.
3. **Clindamycin:** This antibiotic is used to treat malaria in combination with other medications.

**Treatment guidelines:**

1. **Adults:** Treatment for malaria in adults typically involves a 3-day course of ACTs.
2. **Children:** Treatment for malaria in children typically involves a 3-day course of ACTs or a 7-day course of mefloquine.
3. **Pregnant women:** Treatment for malaria in pregnant women typically involves a 3-day course of ACTs.

**Prevention:**

Preventing malaria involves a combination of measures, including:

1. **Vaccination:** The RTS,S vaccine is available for people living in areas with high malaria transmission.
2. **Insecticide-treated bed nets:** Using bed nets treated with insecticides can help prevent mosquito bites.
3. **Personal protective equipment (PPE):** Wearing PPE, such as long-sleeved shirts and pants, can help prevent mosquito bites.
4. **Screened windows and doors:** Installing screened windows and doors can help prevent mosquito bites.
5. **Eliminating standing water:** Eliminating standing water around homes and communities can help prevent mosquito breeding.

It's essential to note that malaria diagnosis and treatment should only be done under the guidance of a qualified healthcare provider.
============================================================

What is working. Responses are medically coherent, well structured, and consistently include appropriate caveats like "consult a healthcare provider." The malaria answer is particularly strong. It correctly distinguishes between microscopy, RDTs, and PCR for diagnosis, and covers ACT treatment guidelines accurately.

What fine-tuning should improve. The model frequently opens with "As a medical assistant, I'd be happy to help you..." That is a filler pattern from instruction tuning, not genuine clinical reasoning. Fine-tuning on MedQuAD should shift responses toward direct, factual answers without the preamble.

One clear failure. The diabetes answer contains a factual error: it attributes increased thirst to insulin causing water retention. The correct mechanism is high blood glucose pulling fluid from tissues through osmosis. This is the hallucination described at the top of this article, superficial keyword correlation producing a plausible-sounding but clinically incorrect causal chain. It is documented here because it is exactly what fine-tuning on curated medical data is meant to fix. If the fine-tuned model gets this right, that is a meaningful result.

What's Next

Week 1 is done. The environment is confirmed, the model is loaded, and the baseline is saved to the repo.

Week 2 is data preparation: loading MedQuAD, inspecting the schema, formatting the dataset into the instruction template Llama expects, and setting up train and eval splits. That is where the actual fine-tuning pipeline begins.

The repo is public: [https://github.com/nicholas-ugbala-dev/healthcare-llm-finetune.git]

Introduction to Machine Learning for Everyone!

Nicholas (Kosisochukwu) Ugbala — Tue, 25 Mar 2025 08:14:53 +0000

Machine Learning: Supervised and Unsupervised Learning

As defined by Arthur Samuel, machine learning is a field of study that allows computers to learn without being explicitly programmed. This enables computers to make predictions and decisions based on data.

Supervised Learning

Supervised learning refers to algorithms that learn from labeled data, mapping input (X) to output (Y). This means the algorithm is provided with examples to learn from before making predictions. Supervised learning is categorized into two main types: regression and classification.

Regression

Regression is used when the goal is to predict a continuous number from infinitely many possible values.

Example:

House Price Prediction: Given features such as the size of a house, location, and number of rooms, a regression model can predict the price of the house.

Classification

Classification, on the other hand, is used to predict categories, where the possible outputs are finite and limited.

Example:

Disease Diagnosis: Identifying whether a patient has a particular disease based on symptoms and test results (e.g., Breast cancer is benign or malignant).

Difference Between Regression and Classification

Regression	Classification
Predicts a continuous number	Predicts categories
Infinitely many possible outputs	Small, limited set of possible outputs
Example: Predicting house prices	Example: Identifying breast cancer

Unsupervised Learning

Unlike supervised learning, unsupervised learning works with data that has no labels. The algorithm must find patterns and structure in the data on its own. A key example of unsupervised learning is clustering.

Clustering

Clustering algorithms group data points into clusters based on similarities.

Example:

Customer Segmentation: E-commerce companies use clustering to group customers based on purchasing behavior.
Social Media Analysis: Identifying communities of users with similar interests.
Genetic Research: Grouping genes with similar expressions for medical research.

Other Types of Unsupervised Learning

Anomaly Detection: Used to detect fraud or unusual activity, such as spotting suspicious transactions in financial data.
Dimensionality Reduction: Compresses data using fewer numbers, making it easier to analyze while preserving its essential structure.

We will explore these types of supervised and unsupervised learning in detail in upcoming discussions. Machine learning is an expansive and captivating field with applications across nearly every industry. Stay tuned as we delve deeper into its concepts and real-world implementations!

DEV Community: Nicholas (Kosisochukwu) Ugbala

Fine-Tuning Llama 3.2 3B on Medical QA: Week 4 - When Lower Loss Meant a Worse Model

What Happened This Week

The Plan

Four changes over Week 3:

The Pad Token Fix

Building the Combined Dataset

The Training Run That Looked Like Success

The Regression

Diagnosing It

The Fix

The Result

What I Actually Learned

Where the Model Lives

What's Next

Fine-Tuning Llama 3.2 3B on Medical QA: Week 3 - The First Training Run

What Happened This Week

The Stack for This Week

What LoRA (Low-Rank Adaptation) is Actually Doing

What prepare_model_for_kbit_training Does

Encountering Hardware Challenge

Training Configuration

Training Results

The System Prompt Is Part of the Model

Before and After

Where the Model Lives

What's Next

Fine-Tuning Llama 3.2 3B on Medical QA: Week 2- Data Preparation

What Happened This Week

The Wrong Dataset

The Right Dataset

The Cleaning Pipeline

Formatting Into the Llama Chat Template

Token Length Distrubtion

Train and Eval Split

Why ChatDoctor and Not Something Better

What's Next

Fine-Tuning Llama 3.2 3B on Medical QA: Week 1 Setup and Baseline Inference

The Problem With General-Purpose LLMs in Healthcare

What I Am Building and Why

Model and Dataset Choice

Base Model: Llama 3.2 3B Instruct

Dataset: MedQuAD (via lavita/medical-qa-datasets on Hugging Face)

Infrastructure and Deployment

The Library Stack

torch (PyTorch)

transformers (Hugging Face)

bitsandbytes (via BitsAndBytesConfig)

peft

trl (SFTTrainer)

datasets (Hugging Face)

accelerate (Hugging Face)

Loading the Model

The Tokenizer

Running Inference

Part 1: Encoding the Input

Part 2: Generating and Decoding

Baseline Results

What's Next

Introduction to Machine Learning for Everyone!

Machine Learning: Supervised and Unsupervised Learning

Supervised Learning

Regression

Example:

Classification

Example:

Difference Between Regression and Classification

Unsupervised Learning

Clustering

Example:

Other Types of Unsupervised Learning

What `prepare_model_for_kbit_training` Does