Fine-Tuning Llama 3.2 3B on Medical QA: Week 1 Setup and Baseline Inference

Nicholas (Kosisochukwu) Ugbala — Tue, 19 May 2026 09:56:18 +0000

The Problem With General-Purpose LLMs in Healthcare

Ask a general-purpose LLM about the early symptoms of type 2 diabetes and it might tell you:

"When your body produces more insulin, it can cause your body to hold onto more water, leading to increased thirst."

That is wrong. Increased thirst in diabetes is caused by high blood glucose pulling fluid from tissues, not insulin. The model arrived at that answer through superficial pattern matching: insulin, fluid retention, and thirst are highly correlated keywords in general web text, so the model linked them causally. In a general conversation, that kind of error is a minor annoyance. In a clinical context, it erodes patient trust and can lead to incorrect self-triage.
This is the core problem with using general-purpose LLMs in healthcare: they are medically coherent on the surface but clinically unreliable underneath. Fine-tuning on a curated medical dataset penalizes these superficial token correlations and forces the model to adhere to verified clinical pathways.
That is what this project is about.

What I Am Building and Why

I am fine-tuning Llama 3.2 3B Instruct on a medical question-answering dataset, deploying it as a public inference API, and documenting the full pipeline from dataset choice to evaluation to deployment.
The goal is to learn the complete fine-tuning pipeline: data preparation, LoRA training, evaluation, and deployment. Healthcare is the domain for this project, but the pipeline is domain-agnostic. Every decision made here applies equally to legal QA, customer support, or any other vertical that needs a specialised language model.

By the end of the project, I will have:

A public GitHub repo with training code and reproducibility instructions
A fine-tuned model checkpoint on Hugging Face Hub
A live FastAPI inference endpoint
A technical write-up of every decision made and why

Model and Dataset Choice

Base Model: Llama 3.2 3B Instruct

The 3B sits in the right spot for the constraints of this project. It is capable enough to produce meaningful answers on medical QA even before fine-tuning, so the baseline is not embarrassing and the demo is not hollow. It is small enough to train on a free-tier T4 GPU with QLoRA. And it has the richest fine-tuning ecosystem of any open model family.

Dataset: MedQuAD (via lavita/medical-qa-datasets on Hugging Face)

MedQuAD is sourced from the USMLE, the United States Medical Licensing Examination. These are board exam questions written and validated by medical professionals. The data is clean, authoritative, and NIH-sourced. Provenance matters in healthcare AI. Using forum-scraped data might produce a more conversational model, but you cannot defend the quality of the training data. With MedQuAD, you can.

An honest caveat: It is worth acknowledging that the USMLE questions reflect US clinical guidelines; it is not a dealbreaker for a portfolio project demonstrating the fine-tuning pipeline. A production system deployed globally would require localisation.

Infrastructure and Deployment

Training Compute: Google Colab (NVIDIA T4 GPU, 15.8GB VRAM). Colab is a cloud-hosted notebook environment that provisions GPU compute on demand. Think of it as a rented virtual machine with a GPU attached, accessible entirely through the browser. It is the right tool for a training run at this scale: no local GPU required, no cloud provisioning overhead.

Model Hosting: Hugging Face Hub. The GitHub of ML models. Model weights, versioned checkpoints, and model cards all live here publicly.
Inference API: FastAPI. A lightweight Python web framework for wrapping the model in an HTTP endpoint.
Containerisation: Docker. The FastAPI inference server is containerised, so the deployment is reproducible.
Fallback GPU: RunPod or Lambda Labs, if the free Colab tier hits limits on the larger training run in Week 4.

The Library Stack

torch (PyTorch)

The foundation on which everything else runs. PyTorch is a numerical computation engine. It handles tensors (multidimensional arrays), moves data between CPU and GPU, and executes the mathematical operations that make the model run. Every other library in this stack is built on top of it.

Why PyTorch and not TensorFlow? The open-source LLM ecosystem, Meta, Mistral, Qwen, ships in PyTorch.

transformers (Hugging Face)

The library that knows how to talk to a specific model. Every model architecture has its own structure: different layer names, tokenization logic, and chat templates. transformers abstracts all of that.

AutoModelForCausalLM: "Auto" means inspect the model config, determine the architecture, and load the correct class automatically. "ForCausalLM" means load it in text-generation mode (predict the next token), not classification mode.

AutoTokenizer: converts raw text into token IDs the model understands, and converts the model's output IDs back into readable text.

apply_chat_template: Llama 3.2 was trained on a specific format using <|system|>, <|user|>, and <|assistant|> tags. Feed it raw text without that structure and the model does not know where the user's question ends and its answer should begin. apply_chat_template applies the correct format automatically.

bitsandbytes (via BitsAndBytesConfig)

This is what enables 4-bit quantization, the technique that makes a 3B parameter model fit on a 15.8GB T4 GPU.

A 3 billion parameter model is 3 billion numbers stored in memory. By default, each number takes 32 bits of space. That puts the raw model at approximately 12GB just to load, before a single token has been processed.

peft

Parameter-Efficient Fine-Tuning. The library that implements LoRA.

Training all 3 billion parameters from scratch would require roughly 24GB of VRAM and hours of compute. LoRA instead adds small trainable adapter matrices to specific layers and trains only those, approximately 1 to 5% of total parameters. The original weights stay frozen. This is what makes fine-tuning feasible on consumer hardware.

trl (SFTTrainer)

Handles the supervised fine-tuning training loop. It manages dataset formatting, gradient accumulation, metric logging, and checkpoint saving. Without it, you write the training loop manually in raw PyTorch, which is correct but tedious.

datasets (Hugging Face)

Standardised interface for loading, filtering, and splitting datasets. Handles streaming large datasets without loading everything into RAM at once. One line to load MedQuAD, one line to split into train and eval.

accelerate (Hugging Face)

Device management. When you write device_map="auto", Accelerate determines how to distribute the model across available hardware. Mostly called under the hood by other libraries, so you rarely interact with it directly.

Loading the Model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "meta-llama/Llama-3.2-3B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({'pad_token': '<pad>'})
model.resize_token_embeddings(len(tokenizer))
tokenizer.padding_side = "right"

A 3 billion parameter model is essentially 3 billion numbers stored in memory. By default, each number takes up 32 bits of space. That is a lot: imagine trying to fit a library into a small room.
Quantization is the process of shrinking each number to take up less space. Instead of 32 bits per number, we store it in 4 bits. The room is the same size, but now the books are smaller, so everything fits. For context, it reduces the books to a size of very tiny stickers; the real size will be a microscopic book.
The tradeoff is a tiny loss in precision; the numbers are slightly less exact. But for most tasks, including medical QA, the model performs well enough that you would not notice the difference.
Quantization shrinks each parameter. The config specifically uses:
load_in_4bit=True: store weights in 4-bit instead of 32-bit. Now the 3B model loads in ~2GB.

bnb_4bit_quant_type=”nf4”: NormalFloat4. A smarter 4-bit format designed specifically for neural network weights, which follow a roughly normal distribution. Better quality than a naive 4-bit rounding.

bnb_4bit_compute_dtype=torch.float16: even though weights are stored in 4-bit, actual computation (matrix multiplication) happens in float16, we can’t do math in 4-bit, we decompress on the fly, compute, then recompress. Float16 is the middle ground for accuracy and speed.
bnb_4bit_use_double_quant=True: quantize the quantization constants too. Saves a small additional amount of VRAM.

The Tokenizer

The model cannot read text. It only understands numbers.

Before any text enters the model, it is converted into a sequence of numbers called token IDs. After the model generates numbers as output, something converts them back into readable text. That is the tokenizer's job. It sits at the entry and exit point of the model.

The tokenizer does not split text into letters or whole words. It splits into tokens, which are chunks that could be a full word, part of a word, punctuation, or a special marker.

text = "What causes iron deficiency anemia?"

tokens = tokenizer.tokenize(text)
# ['What', 'Ġcauses', 'Ġiron', 'Ġdeficiency', 'Ġanemia', '?']

ids = tokenizer.encode(text)
# [3923, 11384, 11245, 32090, 42075, 30]

The model sees [3923, 11384, 11245, 32090, 42075, 30]. Never the actual words.

Longer medical words split into multiple tokens:

tokenizer.tokenize("hypertension")
# ['hyper', 'tension']   two tokens, one word

tokenizer.tokenize("acetaminophen")
# ['acet', 'amin', 'oph', 'en']   four tokens, one word

This is why token count and word count are different numbers, and why models have a context window measured in tokens, not words.

tokenizer.add_special_tokens({'pad_token': '<pad>'}): Llama's tokenizer has no dedicated padding token. Without one, the tokenizer reuses the end-of-sequence token for padding, which means the model cannot distinguish between "end of response" and "this is padding." Adding a distinct pad token removes that ambiguity.

model.resize_token_embeddings(len(tokenizer)): the model's vocabulary just grew by one token. This line tells the model about it. Without it, the model would not know how to handle the new pad token and would crash.

Running Inference

Part 1: Encoding the Input

def generate_response(prompt):
    messages = [
        {"role": "system", "content": "You are a helpful medical assistant."},
        {"role": "user", "content": prompt}
    ]

    encoded_inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        padding=True
    ).to(model.device)

apply_chat_template converts the messages list into the exact string format Llama 3.2 was trained on:

<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful medical assistant.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What causes iron deficiency anemia?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

add_generation_prompt=True: adds the final <|start_header_id|>assistant<|end_header_id|> line. Without it, the model has no signal that the user has finished speaking and it is now expected to respond. Use True during inference. Use False during training, where the full assistant response is already in the data.

return_tensors="pt": packages the token IDs as a PyTorch tensor. The model cannot consume a plain Python list or a NumPy array. "pt" means PyTorch.

return_dict=True: returns both input_ids and attention_mask as a dictionary instead of just the token IDs tensor.

The attention mask is a companion tensor to input_ids, same length, containing only 1s and 0s. 1 means real token, pay attention to it. 0 means padding, ignore it. Without it, the model treats padding as real content and the output degrades, especially during training.

.to(model.device): moves the tensor from system RAM (CPU memory) to VRAM (GPU memory), where the model lives. Data and model must be in the same memory location.model.device resolves to "cuda:0" on the T4, and works on CPU machines too, unlike hardcoding "cuda".

Part 2: Generating and Decoding

    outputs = model.generate(
        input_ids=encoded_inputs["input_ids"],
        attention_mask=encoded_inputs["attention_mask"],
        max_new_tokens=1000,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id
    )

    input_length = encoded_inputs["input_ids"].shape[-1]
    response = outputs[0][input_length:]
    return tokenizer.decode(response, skip_special_tokens=True)

max_new_tokens=1000: hard ceiling on tokens generated. This counts only new tokens, not the input. Without it the model generates indefinitely.

do_sample=True: sample from the probability distribution instead of always picking the highest-scoring token. Deterministic generation (False) produces repetitive, robotic text. For medical QA, we want coherent prose, so sampling gives that.

temperature=0.7: controls how sharp or flat the probability distribution is before sampling. Think of it as a confidence dial. At 0.1, the model always picks the most predictable next word. At 1.5, it picks surprising ones. 0.7 sits between them: responses are grounded and accurate, but not so rigid they read like a dictionary entry.

top_p=0.9: nucleus sampling. After temperature reshapes the distribution, top_p cuts off the long tail. It ranks all tokens by probability, sums them from highest to lowest, and stops at 90%. Only tokens within that 90% are considered. The unlikely, potentially nonsensical tokens at the bottom get discarded. Temperature and top_p work together: one reshapes, the other prunes.

pad_token_id=tokenizer.pad_token_id: explicitly tells the generate function which token represents padding. Removes any remaining ambiguity.

outputs[0][input_length:]: model.generate returns the full sequence, input tokens plus generated tokens concatenated. This slice discards the input and keeps only what the model generated.

tokenizer.decode(response, skip_special_tokens=True): converts token IDs back to readable text and strips structural tokens like <|eot_id|> and <|end_of_text|> from the output.

Baseline Results

The model was run against five test questions before any fine-tuning. These outputs serve as the benchmark. Every evaluation after training will be compared directly against them.

Q: What are the early symptoms of type 2 diabetes?)
A: assistant

As a medical assistant, I'd be happy to help you identify the early symptoms of type 2 diabetes.

Type 2 diabetes is often referred to as "silent diabetes" because it can be asymptomatic for a long time, especially in the early stages. However, some people may experience the following early symptoms:

1. **Increased thirst and hunger**: When your body produces more insulin, it can cause your body to hold onto more water, leading to increased thirst and hunger.
2. **Frequent urination**: As your body tries to flush out excess glucose, you may need to urinate more frequently, especially at night.
3. **Fatigue**: High blood sugar levels can cause fatigue, which can be mistaken for other conditions.
4. **Blurred vision**: High blood sugar levels can cause the lens in your eye to swell, leading to blurred vision.
5. **Slow healing of cuts and wounds**: High blood sugar levels can impede the healing process, leading to slower recovery from cuts and wounds.
6. **Tingling or numbness in your hands and feet**: High blood sugar levels can damage nerve endings, causing numbness or tingling sensations in your hands and feet.
7. **Recurring skin, gum, or bladder infections**: High blood sugar levels can weaken your immune system, making you more susceptible to infections.
8. **Weight gain or loss**: Some people may experience weight gain or loss due to the changes in their metabolism.
9. **Mood changes**: High blood sugar levels can cause mood swings, anxiety, and depression.

Keep in mind that these symptoms can be subtle and may not always be present. If you're concerned about your risk of developing type 2 diabetes, it's essential to talk to your healthcare provider about your individual risk factors and get regular check-ups.

Remember, early detection and treatment can make a significant difference in managing type 2 diabetes. If you're experiencing any of these symptoms, don't hesitate to consult with your healthcare provider.
============================================================
Q: How is malaria diagnosed and treated?)
A: assistant

As a medical assistant, I'd be happy to explain the diagnosis and treatment of malaria.

**Diagnosis:**

Malaria is typically diagnosed using a combination of clinical evaluation, laboratory tests, and rapid diagnostic tests (RDTs). Here are the steps involved in diagnosing malaria:

1. **Clinical evaluation:** A healthcare provider will ask about symptoms, medical history, and travel history to determine if the patient has malaria.
2. **Physical examination:** A physical examination will be performed to look for signs of malaria, such as fever, chills, and flu-like symptoms.
3. **Blood tests:** A blood sample will be taken to check for the presence of malaria parasites in the blood. This can be done using:
    * **Microscopy:** A microscope is used to examine the blood sample under a microscope to look for the parasites.
    * **Rapid diagnostic tests (RDTs):** RDTs are quick and easy-to-use tests that detect the presence of malaria antigens in the blood.
    * **Polymerase chain reaction (PCR):** PCR is a molecular test that detects the genetic material of the malaria parasite.
4. **Confirmatory testing:** If the initial tests are positive, further testing may be done to confirm the diagnosis.

**Treatment:**

Treatment for malaria typically involves antimalarial medications. The type and dosage of medication will depend on the type of malaria, the patient's age, weight, and medical history, as well as any underlying medical conditions.

**Common antimalarial medications:**

1. **Artemisinin-based combination therapies (ACTs):** These are the most commonly used medications for treating malaria. Examples include artesunate and mefloquine.
2. **Quinine:** This is an older medication that is still used to treat malaria, particularly in areas where ACTs are not available.
3. **Clindamycin:** This antibiotic is used to treat malaria in combination with other medications.

**Treatment guidelines:**

1. **Adults:** Treatment for malaria in adults typically involves a 3-day course of ACTs.
2. **Children:** Treatment for malaria in children typically involves a 3-day course of ACTs or a 7-day course of mefloquine.
3. **Pregnant women:** Treatment for malaria in pregnant women typically involves a 3-day course of ACTs.

**Prevention:**

Preventing malaria involves a combination of measures, including:

1. **Vaccination:** The RTS,S vaccine is available for people living in areas with high malaria transmission.
2. **Insecticide-treated bed nets:** Using bed nets treated with insecticides can help prevent mosquito bites.
3. **Personal protective equipment (PPE):** Wearing PPE, such as long-sleeved shirts and pants, can help prevent mosquito bites.
4. **Screened windows and doors:** Installing screened windows and doors can help prevent mosquito bites.
5. **Eliminating standing water:** Eliminating standing water around homes and communities can help prevent mosquito breeding.

It's essential to note that malaria diagnosis and treatment should only be done under the guidance of a qualified healthcare provider.
============================================================

What is working. Responses are medically coherent, well structured, and consistently include appropriate caveats like "consult a healthcare provider." The malaria answer is particularly strong. It correctly distinguishes between microscopy, RDTs, and PCR for diagnosis, and covers ACT treatment guidelines accurately.

What fine-tuning should improve. The model frequently opens with "As a medical assistant, I'd be happy to help you..." That is a filler pattern from instruction tuning, not genuine clinical reasoning. Fine-tuning on MedQuAD should shift responses toward direct, factual answers without the preamble.

One clear failure. The diabetes answer contains a factual error: it attributes increased thirst to insulin causing water retention. The correct mechanism is high blood glucose pulling fluid from tissues through osmosis. This is the hallucination described at the top of this article, superficial keyword correlation producing a plausible-sounding but clinically incorrect causal chain. It is documented here because it is exactly what fine-tuning on curated medical data is meant to fix. If the fine-tuned model gets this right, that is a meaningful result.

What's Next

Week 1 is done. The environment is confirmed, the model is loaded, and the baseline is saved to the repo.

Week 2 is data preparation: loading MedQuAD, inspecting the schema, formatting the dataset into the instruction template Llama expects, and setting up train and eval splits. That is where the actual fine-tuning pipeline begins.

The repo is public: [https://github.com/nicholas-ugbala-dev/healthcare-llm-finetune.git]

Introduction to Machine Learning for Everyone!

Nicholas (Kosisochukwu) Ugbala — Tue, 25 Mar 2025 08:14:53 +0000

Machine Learning: Supervised and Unsupervised Learning

As defined by Arthur Samuel, machine learning is a field of study that allows computers to learn without being explicitly programmed. This enables computers to make predictions and decisions based on data.

Supervised Learning

Supervised learning refers to algorithms that learn from labeled data, mapping input (X) to output (Y). This means the algorithm is provided with examples to learn from before making predictions. Supervised learning is categorized into two main types: regression and classification.

Regression

Regression is used when the goal is to predict a continuous number from infinitely many possible values.

Example:

House Price Prediction: Given features such as the size of a house, location, and number of rooms, a regression model can predict the price of the house.

Classification

Classification, on the other hand, is used to predict categories, where the possible outputs are finite and limited.

Example:

Disease Diagnosis: Identifying whether a patient has a particular disease based on symptoms and test results (e.g., Breast cancer is benign or malignant).

Difference Between Regression and Classification

Regression	Classification
Predicts a continuous number	Predicts categories
Infinitely many possible outputs	Small, limited set of possible outputs
Example: Predicting house prices	Example: Identifying breast cancer

Unsupervised Learning

Unlike supervised learning, unsupervised learning works with data that has no labels. The algorithm must find patterns and structure in the data on its own. A key example of unsupervised learning is clustering.

Clustering

Clustering algorithms group data points into clusters based on similarities.

Example:

Customer Segmentation: E-commerce companies use clustering to group customers based on purchasing behavior.
Social Media Analysis: Identifying communities of users with similar interests.
Genetic Research: Grouping genes with similar expressions for medical research.

Other Types of Unsupervised Learning

Anomaly Detection: Used to detect fraud or unusual activity, such as spotting suspicious transactions in financial data.
Dimensionality Reduction: Compresses data using fewer numbers, making it easier to analyze while preserving its essential structure.

We will explore these types of supervised and unsupervised learning in detail in upcoming discussions. Machine learning is an expansive and captivating field with applications across nearly every industry. Stay tuned as we delve deeper into its concepts and real-world implementations!

DEV Community: Nicholas (Kosisochukwu) Ugbala

Fine-Tuning Llama 3.2 3B on Medical QA: Week 1 Setup and Baseline Inference

The Problem With General-Purpose LLMs in Healthcare

What I Am Building and Why

Model and Dataset Choice

Base Model: Llama 3.2 3B Instruct

Dataset: MedQuAD (via lavita/medical-qa-datasets on Hugging Face)

Infrastructure and Deployment

The Library Stack

torch (PyTorch)

transformers (Hugging Face)

bitsandbytes (via BitsAndBytesConfig)

peft

trl (SFTTrainer)

datasets (Hugging Face)

accelerate (Hugging Face)

Loading the Model

The Tokenizer

Running Inference

Part 1: Encoding the Input

Part 2: Generating and Decoding

Baseline Results

What's Next

Introduction to Machine Learning for Everyone!

Machine Learning: Supervised and Unsupervised Learning

Supervised Learning

Regression

Example:

Classification

Example:

Difference Between Regression and Classification

Unsupervised Learning

Clustering

Example:

Other Types of Unsupervised Learning