DEV Community

Cover image for Fine-Tuning Llama 3.2 3B on Medical QA: Week 4 - When Lower Loss Meant a Worse Model

Fine-Tuning Llama 3.2 3B on Medical QA: Week 4 - When Lower Loss Meant a Worse Model

What Happened This Week

Week 3 produced a working fine-tuned model: one epoch, one dataset, a clear improvement over the base model. This week 4 was supposed to make it better with More data (a second dataset), two epochs, and a cleaner setup.

The eval loss dropped from 2.495 to 2.275. By that number alone, Week 4 was going to be a success.

The model was worse.

This is the story of how a better loss number hid a serious regression, how I diagnosed it, and what it took to actually fix it. It is one of the most useful things I have learned in this project.

The Plan

Four changes over Week 3:

  1. Combine two datasets: ChatDoctor (conversational patient-doctor QA) and MedAlpaca WikiDoc (encyclopedic clinical reference), for both conversational style and factual grounding.
  2. Use Llama's built-in pad token instead of adding a custom one to avoid an oversized adapter file.
  3. Train for two epochs on the full dataset instead of one.
  4. Switch evaluation to greedy decoding for reproducibility.

The Pad Token Fix

In Week 3, I added a custom pad token and resized the model's embedding layer. This had an unintended cost: PEFT saved the entire resized embedding layer alongside the LoRA adapters, producing a 3.19GB adapter file instead of the expected ~50MB.

Llama 3.2's tokenizer already ships with a reserved padding token, <|finetune_right_pad_id|> (token 128004), made for exactly this purpose. Using it instead of adding a new token:

...
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.padding_side = "right"
# No add_special_tokens, no resize_token_embeddings
Enter fullscreen mode Exit fullscreen mode

No embedding resize means no embedding layer saved with the adapters. The Week 4 adapter came out at ~50MB. What I Learned: Before adding a special token, check whether the model already has one. Llama 3.2 did.

Building the Combined Dataset

ChatDoctor alone produced a model that answered in a conversational manner but sometimes lacked factual grounding. WikiDoc is reference-grade encyclopedic medical content. The Idea was that combining them would give both conversational style and factual grounding.

The first combine used 8,000 ChatDoctor and 4,000 WikiDoc, a 2:1 ratio. After cleaning and a 512-token length filter, this produced 10,255 rows: 9,229 train, 1,026 eval.

The cleaning itself was an exercise in diminishing returns.
ChatDoctor forum data carries platform filler ("Hello, welcome to Chat Doctor", "Hope this helps"), and the source has OCR-level corruption that breaks pattern matching ("HiT hanks" for "Hi. Thanks"). I built a two-pass regex cleaner plus a sentence-level trailing filler stripper that removes whole closing sentence containing filler keywords. It caught most of the noise. A small fraction of corruption resisted cleaning entirely, which I documented rather than chasing.

The Training Run That Looked Like Success

Two epochs, 1,154 steps, about four hours on a Kaggle T4.

Step 150:   train 2.499  |  eval 2.474
Step 300:   train 2.336  |  eval 2.340
Step 600:   train 2.282  |  eval 2.297
Step 900:   train 2.241  |  eval 2.279
Step 1050:  train 2.231  |  eval 2.275
Enter fullscreen mode Exit fullscreen mode

A clean, healthy loss curve. Eval loss dropped steadily to 2.275, well below Week 3's 2.495. Train and eval tracked each other closely, indicating no classic overfitting. Mean token accuracy rose to 0.515.

Every number said the model improved.

The Regression

Then I ran the five test questions with greedy decoding.

The diabetes answer began correctly, then collapsed:

Eye yawning
Eye yawns
Eye years
Eye yolks
Eye yummy
Eye yogurt
Enter fullscreen mode Exit fullscreen mode

A complete generation breakdown, under greedy decoding, which is supposed to be the stable option. The heart attack answer produced a runaway list that drifted from cardiac symptoms into sore throats and ear pain. Hypertension confidently recommended atenolol as first-line therapy, which is wrong: beta-blockers are not first-line for uncomplicated hypertension.

The model with the better loss number produced worse answers than Week 3's model.

Diagnosing It

Two things were happening, and separating them mattered.

First, the repetition penalty backfired. I had set no_repeat_ngram_size=3, which forbids repeating any three-token sequence. Once it generates a three-token phrase like "consult your doctor", it can never produce that exact phrase again in the same answer. The intent is to stop repetition loops. The effect was the opposite: when the model wanted to end a list by repeating a natural closing pattern, the rule forbade it, forcing a brand-new token every time. The only way to keep producing non-repeating tokens was to drift into nonsense: "Eye yummy, Eye yogurt." The setting meant to prevent loops was driving the degeneration.

Second, and more fundamental: the model had overfit to list generation. The combined dataset, especially the Wiki/doc half, contained many list-formatted answers. Two epochs reinforced a pattern: when answering, produce a list and keep extending it. On questions with naturally bounded answers (a mechanism or short cause), the model stayed controlled. On questions inviting enumeration (drugs, symptoms), it started a list and could not stop, eventually confabulating list items: invented drugs like "artuzofloxacin", invented symptoms.

The loss curve never showed this because loss measures next-token prediction accuracy on the eval set. A model can be better at predicting the next token while getting worse at producing a coherent, bounded, truthful answer.

The Fix

Three changes. applied together.

Rebalanced the data.* Dropped WikiDoc from 4,000 to 1,500 and raised ChatDoctor to 8,500, roughly 85% narrative prose, 15% encyclopedic. ChatDoctor's conversational answers train the model toward bounded, flowing responses rather than open-ended lists. This attacked the root behaviour.

Expanded the LoRA target modules. Those module names need a short explanation. Each layer of the model has two parts that do different jobs. The attention layers(q_proj, k_proj,v_proj, o_proj) decide what to pay attention to: how tokens relate to each other, how "the patient" connects to "their symptoms" later in the same question. The feed-forward layers (up_proj, down_proj, gate_proj) are where factual knowledge tends to be stored and retrieved; research shows they behave somewhat like a key-value memory, where a concept goes in and the associated facts come out.

Week 3 applied LoRA only to the attention layers, leaving the feed-forward layers frozen. That tuned how the model routes information but left the layers that hold the facts untouched. The confabulation, inventing drugs like "artuzofloxacin", was a factual recall failure: the model could not keep real drug names active while generating a list. So in Week 4, I added the feed-forward layers to the LoRA targets, letting fine-tuning adjust the part of the model where the facts live, not just the attention routing. (The claim that facts live in the feed-forward layers is a simplification; knowledge is distributed across the whole model. But as a reason to target those layers for a recall problem, it holds.)

Fixed generation. Removed no_repeat_ngram_size entirely. Set eos_token_id explicitly to <|eot_id|> so the model can actually stop. Used repetition_penalty=1.3 to discourage loops without a hard ngram ban, and capped max_new_tokens at 256.

outputs = model.generate(
    input_ids=encoded_inputs["input_ids"],
    attention_mask=encoded_inputs["attention_mask"],
    max_new_tokens=256,
    do_sample=False,
    repetition_penalty=1.3,
    eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"),
    pad_token_id=tokenizer.pad_token_id,
)
Enter fullscreen mode Exit fullscreen mode

One epoch on the rebalanced data.

The Result

The degeneration was gone. Hypertension named four real first-line drug classes (ACE inhibitors, ARBs, beta-blockers, calcium channel blockers) and stopped. Malaria named real treatments (artemether-lumefantrine, chloroquine, mefloquine) and stopped. The diabetes and iron deficiency answers stayed accurate. The heart attack answer, which had failed in every previous run, finally produced seven correct cardiac warning signs and stopped.

And running the same question twice produced byte-for-byte identical output. Greedy decoding made the results reproducible, which is what makes the claims defensible. In Week 3, the same question could give a good answer one run and collapse the next. Now the model's behaviour is consistent and verifiable.

What I Actually Learned

Lower Loss is not a better model. Eval loss measures next-token prediction. It does not measure factual accuracy, coherence, or whether the model knows when to stop. The Week 4 two-epoch model had the best loss and the worst generation. I wouldn't have caught this if I went with the notion that decreased loss equals a better-performing model, and not manually test the model's output.

Generation settings are not an afterthought. The same weights produced a total collapse or a clean answer depending on the decoding configuration. A repetition penalty meant to help actively caused the degeneration. Half of the battle with a small model is how you decode it.

Small models have a ceiling. A 3B fine-tuned on consumer hardware handles clinical QA well but struggles to enumerate without confabulating. Rebalancing the data and expanding LoRA targets pushed that ceiling up, but it is a real limit. In production, the answer would be a larger base model. Naming the constraint honestly gives clarity on how to go about making better improvements.

Reproducibility is a feature you build in. Greedy decoding, a fixed seed, a pinned data sample. Without these, "the model does X" is not a claim I can stand behind.

Where the Model Lives

nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned-v2

Adapter file: ~50MB, the pad token fix working as intended.

What's Next

Week 5 wraps the model in a FastAPI inference endpoint, containerises it with Docker, and deploys it to a public URL any one can call. The generation settings worked out this week become the server's defaults.

Model: huggingface.co/nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned-v2
Dataset: huggingface.co/datasets/nicholas-ugbala-hf/medical-qa-narrative-10k
Repo: github.com/nicholas-ugbala-dev/healthcare-llm-finetune

Top comments (0)