What Happened This Week
Week 2 ended with a cleaned dataset formatted and pushed to Hugging Face Hub. Week 3 is where the actual fine-tuning happened: configuring LoRA adapters, running the training loop, and comparing the fine-tuned model's outputs against the Week 1 baseline.
It did not go smoothly. This post documents what broke, why it broke, and what the results actually showed.
The Stack for This Week
The fine-tuning stack builds on top of what was installed in Week 1:
- peft – implements LoRA. It adds small trainable adapter matrices to specific layers of the frozen base model.
- trl – provides SFTTrainer, the supervised fine-tuning training loop. It handles batching, gradient accumulation, evaluation, and checkpointing.
- bitsandbytes – still handling 4-bit quantization so the 3B model fits in GPU memory.
What LoRA (Low-Rank Adaptation) is Actually Doing
Before getting into the training run, it is worth explaining what LoRA does because it is the core technique of this project.
The base model has 3.2 billion parameters. Updating all of them during fine-tuning would require roughly 24GB of VRAM and hours of compute. That is not feasible on a free GPU.
LoRA does not update the original weights at all. Instead, it adds two small trainable matrices alongside specific layers. Every attention layer in the model has weight matrices for query, key, value, and output projections. For each of these, LoRA adds:
Matrix A: 16 x 3072 = 49,152 parameters
Matrix B: 3072 x 16 = 49,152 parameters
During the forward pass, the layer computes:
W: frozen base model weights.
scale: controls the strength of the adjustment (scale = lora_alpha / r)
output = (W x input) + scale (B x A x input)
W is frozen. Only A and B receive gradient updates. Across 28 transformer layers and 4 target modules per layer, this amounts to 9 million trainable parameters out of 3.2 billion total. That is 0.28% of the model being trained. The original weights stay completely untouched.
The rank value 16 controls the adapter's capacity. Higher rank means more expressive adapters but more parameters to train. 16 is the standard starting point for a 3B model.
What prepare_model_for_kbit_training Does
This is a function called before applying the LoRA adapters, and it is worth explaining precisely.
The model is loaded in 4-bit quantization, which means its weights are compressed and frozen. Before training, PyTorch needs to know how to propagate gradients backward through those frozen quantized layers to reach the LoRA adapters.
By default, PyTorch sees frozen layers and stops tracking gradient flow through them entirely. The error signal from the loss never reaches the LoRA matrices. No gradient, no learning.
prepare_model_for_kbit_training tells PyTorch to keep passing gradients through the frozen layers even though those layers are not being updated. The frozen layers are passthrough nodes in the gradient computation graph. The LoRA adapters at the end of the chain receive the gradient and update accordingly.
Without this call, training completes without error, but the LoRA weights never change. The model would be identical to the base model after training.
Encountering Hardware Challenge
The initial plan was to train on Google Colab's free T4 GPU with fp16 mixed precision enabled. This failed.
The error was:
NotImplementedError: "_amp_foreach_non_finite_check_and_unscale_cuda"
not implemented for 'BFloat16'
The fp16 gradient scaler encountered BFloat16 tensors created internally by prepare_model_for_kbit_training during the backward pass. The T4 supports float16 natively, but the interaction between the quantized base model layers and the LoRA adapter initialization in newer PEFT versions produces BFloat16 intermediate tensors that the fp16 scaler cannot process.
The working solution was to disable mixed precision entirely with fp16=False, bf16=False and move to Kaggle's T4 GPU, which provided a 30-hour session limit per week instead of Colab's 2-4-hour cutoff.
The tradeoff: float32 gradient computation is roughly 2 to 3x slower than fp16 on a T4. The training run that would have taken 45 minutes with fp16 took 1 hour 13 minutes without it.
This is a library version compatibility issue specific to this combination of TRL 1.5.1, PEFT, and T4 hardware. On an A100 or V100, bf16=True or fp16=True respectively, would work without any of these conflicts.
Training Configuration
training_args = SFTConfig(
output_dir="/kaggle/working/checkpoints",
num_train_epochs=1,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=50,
learning_rate=2e-4,
fp16=False,
bf16=False,
gradient_checkpointing=True,
logging_steps=25,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
report_to="none",
)
A few decisions worth explaining:
gradient_accumulation_steps=4: with batch size 4, each weight update uses the accumulated gradient from 16 samples (4 x 4). This simulates a larger batch without the VRAM cost of loading 16 samples simultaneously.
gradient_checkpointing=True: instead of storing all intermediate layer activations in VRAM during the forward pass, PyTorch recomputes them during the backward pass when needed. Significant VRAM savings at the cost of roughly 20% slower training. On a 15.6GB T4 running float32, this was necessary to avoid OOM(OutOfMemoryError: CUDA out of memory).
load_best_model_at_end=True: after training completes, load the checkpoint with the lowest eval loss rather than the final checkpoint. This protects against slight overfitting in the last few steps.
learning_rate=2e-4: the standard starting point for LoRA fine-tuning. It controls how aggressively the adapter weights are updated per step.
Training Results
Step 100: train loss 2.570 | eval loss 2.558
Step 200: train loss 2.525 | eval loss 2.511
Step 300: train loss 2.482 | eval loss 2.495
Final: 309 steps, 1 epoch, 4,937 samples, 1h 13m
Training loss decreased steadily across all 309 steps. Eval loss tracked it closely with a gap of only 0.013 at the final step. A large and growing gap between training and eval loss would indicate overfitting. The consistent small gap indicates the model is generalizing to unseen examples, not just memorizing the training data.
The loss was still declining at step 309, which means training stopped before full convergence. This is expected for one epoch. Week 4 will run two epochs on the full 9,000-sample dataset.
The System Prompt Is Part of the Model
This is also worth noting: the system prompt used during inference must match exactly what was used during training.
Every training sample was formatted with:
If you are a doctor, please answer the medical questions
based on the patient's description.
The model spent 309 steps learning to produce clinical prose in response to that specific framing. Using a different system prompt at inference time places the model in a context it was never fine-tuned on and degrades output quality. This is a subtle but important constraint that needs mentioning.
Before and After
These are the same five questions asked of the base model in Week 1 and the fine-tuned model after training.
The critical comparison — type 2 diabetes:
Base model:
"When your body produces more insulin, it can cause your body to hold onto more water, leading to increased thirst."
Fine-tuned model:
"Increased thirst and urination: High blood sugar levels can cause the body to produce more urine, leading to dehydration and increased thirst."
The hallucination is gone. The base model fabricated a causal link between insulin and water retention. The fine-tuned model correctly attributes increased thirst to high blood sugar levels causing osmotic diuresis. That is the exact mechanism this project was designed to fix.
What else improved:
Filler openers are completely gone across all five responses. The base model opened every answer with "As a medical assistant, I'd be happy to help you..." None of the fine-tuned responses contain this pattern. Responses go directly into clinical content.
The hypertension response improved significantly. The fine-tuned model stratifies treatment by severity and names specific drug classes with examples. The base model was more generic.
What still needs work:
The diabetes response lists "slow speech" as a symptom of type 2 diabetes. It is not. Slow speech is associated with hypoglycemia or stroke, not early-stage type 2. One epoch over 4,937 samples did not eliminate all hallucinations.
The heart attack response includes "coughing up blood" and "pain or discomfort in the face, especially the cheeks or forehead" as warning signs. Neither is a classic cardiac symptom. One response ends with "Hope this answers your question. Let me know if I can assist you further" — a filler pattern that survived the cleaning pipeline.
These residual errors are the specific targets for Week 4.
Where the Model Lives
The fine-tuned adapter weights are published on Hugging Face Hub:
nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned
The repo contains the LoRA adapter weights (adapter_model.safetensors), the adapter configuration (adapter_config.json), and the modified tokenizer with the custom pad token. The base model stays at Meta's repo. Loading the fine-tuned model requires both.
One note on file size: adapter_model.safetensors is 3.19GB rather than the expected 30 to 60MB for LoRA adapters. This is because PEFT saved the full embedding layer alongside the adapters, since the tokenizer vocabulary was extended by one token during training. The functionality is identical but the file size reflects this tradeoff.
What's Next
Week 3 confirmed the pipeline works and produced measurable improvement. The residual errors are clearly identified.
Week 4 targets: full 9,000 sample dataset, two epochs, tighter data cleaning to remove surviving filler patterns, and a second dataset to improve factual grounding.
Model: huggingface.co/nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned
Repo: github.com/nicholas-ugbala-dev/healthcare-llm-finetune
Dataset: huggingface.co/datasets/nicholas-ugbala-hf/chatdoctor-cleaned-10k
Top comments (0)