<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nicholas (Kosisochukwu) Ugbala</title>
    <description>The latest articles on DEV Community by Nicholas (Kosisochukwu) Ugbala (@nicholas-ugbala-dev).</description>
    <link>https://dev.to/nicholas-ugbala-dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2955401%2F9d138ab9-ac94-4800-8935-790ee4bc81fb.jpeg</url>
      <title>DEV Community: Nicholas (Kosisochukwu) Ugbala</title>
      <link>https://dev.to/nicholas-ugbala-dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nicholas-ugbala-dev"/>
    <language>en</language>
    <item>
      <title>Fine-Tuning Llama 3.2 3B on Medical QA: Week 2- Data Preparation</title>
      <dc:creator>Nicholas (Kosisochukwu) Ugbala</dc:creator>
      <pubDate>Wed, 27 May 2026 08:11:36 +0000</pubDate>
      <link>https://dev.to/nicholas-ugbala-dev/fine-tuning-llama-32-3b-on-medical-qa-week-2-data-preparation-5812</link>
      <guid>https://dev.to/nicholas-ugbala-dev/fine-tuning-llama-32-3b-on-medical-qa-week-2-data-preparation-5812</guid>
      <description>&lt;h2&gt;
  
  
  What Happened This Week
&lt;/h2&gt;

&lt;p&gt;Week 1 established the baseline. This week is where the actual engineering begins.&lt;/p&gt;

&lt;p&gt;Before any fine-tuning can happen, the training data has to be in the exact format the model expects. That sounds simple. It is not. This week involved loading a 112K-row medical dataset, discovering it was the wrong dataset for the goal, switching to a different dataset, building a cleaning pipeline, and formatting everything into the Llama 3.2 chat template. Every step had a decision worth explaining.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Wrong Dataset
&lt;/h2&gt;

&lt;p&gt;The initial plan was to use &lt;code&gt;lavita/medical-qa-datasets&lt;/code&gt; with the &lt;code&gt;medical_meadow_medqa&lt;/code&gt; subset. Loading it and inspecting the samples revealed a problem I initially ignored.&lt;br&gt;
The outputs looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OUTPUT: D: Trimethoprim-sulfamethoxazole
OUTPUT: A: The most important risk factors are hypertension and diabetes
OUTPUT: E: Pneumovax
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are answer selections, not clinical explanations. The dataset is USMLE multiple-choice questions. Training on this would produce a model that selects answer letters from five options, which is not the goal. The goal is a model that answers clinical questions in clear, factual prose.&lt;/p&gt;

&lt;p&gt;The dataset was correct in provenance (NIH-sourced, board-exam quality) but incorrect in shape. Switching the dataset was the right call.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Right Dataset
&lt;/h2&gt;

&lt;p&gt;ChatDoctor HealthCareMagic 100K (&lt;code&gt;lavita/ChatDoctor-HealthCareMagic-100K&lt;/code&gt;) is 112,165 real patient questions with doctor responses in prose format. Output looks like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Fibrotic scarring in the right apical region of the lung may be due to past infection like tuberculosis. Fibrosis is a healed stage and generally does not require treatment. You may need to follow up with a chest physician for monitoring."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the output style the fine-tuned model should produce. Conversational, factual, direct.&lt;/p&gt;

&lt;p&gt;The tradeoff: this is real forum data, not curated clinical text. Quality varies. Some responses are excellent clinical reasoning. Others are vague. The engineering problem for Week 2 was building a cleaning pipeline that keeps the signal and removes the noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cleaning Pipeline
&lt;/h2&gt;

&lt;p&gt;Loading the raw dataset and inspecting samples revealed four specific problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Platform filler in outputs&lt;/strong&gt;. Every response opens with noise that the model will learn and replicate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Hello, welcome to Chat Doctor..."
"Thanks for using Chat Doctor..."
"Hi Dear, Welcome to Chat Doctor..."
"and I hope I can help you today..."
"Thank you for posting your query..."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If these survive into training data, the fine-tuned model will learn to open every response the same way. That will be way worse than the base model's filler.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trailing sign-offs&lt;/strong&gt;. Outputs ended with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"...Best wishes, Chat Doctor."
"...I hope this helps."
"...Take care."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same problem with the filler starters for the output. These are social conventions from a forum platform, not clinical reasoning patterns worth learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Platform name artifacts in inputs.&lt;/strong&gt; Some patients' input contained platform name mid-sentence, leaked in through copy-paste errors during data collection. Training on these teaches the model that "ChatDoctor" is a meaningful clinical term.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output Quality Variance.&lt;/strong&gt; Some outputs were too short to contain useful clinical content, fewer than 30 words. Some sequences were too long for the T4's VRAM budget when tokenized.&lt;/p&gt;

&lt;p&gt;The cleaning function strips filler from both ends of every output using regex patterns, and the filter function rejects rows that failed quality thresholds. A second pass removed any samples where platform artifacts survived cleaning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def clean_output(text):
    filler_starts = [
        r'^Hello[\w\s,]*Welcome to Chat\s?Doctor[.\s]*',
        r'^and I hope I can help you today\.?[\s]*',
        r'^Thank you for (posting|consulting|writing|using)[\w\s,]*[.\s]+',
        r'^Hello[\s,]+',
        r'^Hi[\s,]+',
        r'^Dear[\w\s,]+,[\s]*',
    ]
    for pattern in filler_starts:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE).strip()

    filler_ends = [
        r'[,.]?\s*Best wishes,?\s*Chat\s?Doctor\.?$',
        r'[,.]?\s*I hope (this|it) helps?\.?$',
        r'[,.]?\s*Take care\.?$',
    ]
    for pattern in filler_ends:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE).strip()

    return text

def is_clean(sample):
    if re.search(r'chatdoctor', sample['input'], re.IGNORECASE):
        return False
    if len(sample['output'].split()) &amp;lt; 30:
        return False
    if len(sample['input'].split()) + len(sample['output'].split()) &amp;gt; 600:
        return False
    return True
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 112,165 rows cleaned to 45,205. About 60% of the data was removed. It may seem like we now have less data to work with or to help improve our model. The point here is, A model trained on 45K clean samples will outperform one trained on 112K noisy set.&lt;/p&gt;

&lt;p&gt;10,000 rows were then sampled randomly with &lt;code&gt;seed=42&lt;/code&gt; for reproducibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  Formatting Into the Llama Chat Template
&lt;/h2&gt;

&lt;p&gt;The model was trained on a specific conversation format. Feeding it data in any other structure will produce a corrupted training signal because the model does not know which tokens are the user's questions and which are the assistant's answers.&lt;/p&gt;

&lt;p&gt;Every sample was converted from this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "instruction": "If you are a doctor, please answer the medical questions...",
    "input": "I have been having sharp chest pain on my left side...",
    "output": "Sharp chest pain that worsens with deep breathing..."
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Into this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;|begin_of_text|&amp;gt;
&amp;lt;|start_header_id|&amp;gt;system&amp;lt;|end_header_id|&amp;gt;
If you are a doctor, please answer the medical questions based on the patient's description.
&amp;lt;|eot_id|&amp;gt;
&amp;lt;|start_header_id|&amp;gt;user&amp;lt;|end_header_id|&amp;gt;
I have been having sharp chest pain on my left side for two days...
&amp;lt;|eot_id|&amp;gt;
&amp;lt;|start_header_id|&amp;gt;assistant&amp;lt;|end_header_id|&amp;gt;
Sharp chest pain that worsens with deep breathing is often pleuritic in nature...
&amp;lt;|eot_id|&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;instruction&lt;/code&gt; field is identical across 112K rows. It is a static task description, not a per-sample instruction. It belongs in the system prompt, which is a model's standing brief about its role. The patient's question goes in the &lt;code&gt;user turn&lt;/code&gt;. The doctor's response goes in the &lt;code&gt;assistant's turn&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;add_generation_prompt=False&lt;/code&gt; during training, because the full assistant response is already in the data. The model is learning to produce that response, not being asked to generate it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Token Length Distrubtion
&lt;/h2&gt;

&lt;p&gt;Before finalising the dataset, the token length of every formatted samples was measured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Shortest:    78 tokens
Longest:     794 tokens
Average:     261 tokens
Over 512:    110 samples (1.1%)
Over 1024:   0 samples
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;max_seq_length = 512&lt;/code&gt; was chosen for training. Only 1.1% of samples exceed it, so truncation loss is negligible. Using 512 instead of 1024 means less VRAM per sequence, faster training, and the effective batch sizes on the T4.&lt;/p&gt;




&lt;h2&gt;
  
  
  Train and Eval Split
&lt;/h2&gt;

&lt;p&gt;The 10K formatted samples were slit 90/10:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;split = formatted.train_test_split(test_size=0.1, seed=42)
# Train: 9,000 samples
# Eval:  1,000 samples
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I considered an 80/10/10 three-way split but ultimately decided not to go through with it. In fine-tuning, a separate test set adds little value. The model is not making architectural decisions based on held-out results. The eval set monitors training loss. The real qualitative test is the five baseline questions from Week 1 run through the fine-tuned model after training.&lt;br&gt;
The cleaned dataset is published publicly on Hugging Face Hub for full reproducibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why ChatDoctor and Not Something Better
&lt;/h2&gt;

&lt;p&gt;ChatDoctor is not the highest-quality medical dataset available. PubMedQA has a better clinical provenance. Augmented MedQA with chain-of-thought reasoning would produce stronger results. A GPT-4 synthesised dataset from medical textbooks would be cleaner.&lt;/p&gt;

&lt;p&gt;ChatDoctor was chosen for three specific reasons. First, the output format matches the goal: conversational prose responses to patient-described symptoms. PubMedQA produces yes/no research answers, not clinical explanations. MedQA is multiple choice. Neither matches the target output style. Second, it is publicly available, ungated, and immediately loadable without preprocessing overhead. Augmented chain-of-thought versions of MedQA do not exist as clean public datasets and would require GPT-4 generation to create, introducing a proprietary dependency. Third, the cleaning problem is real and representative: building a pipeline that filters 112K noisy forum rows to 45K usable samples is closer to production data engineering than loading a pre-sanitised benchmark. For a project demonstrating the full fine-tuning pipeline, that tradeoff is deliberate.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Week 2 is done. The dataset is clean, formatted, split, and is live on Hugging Face.&lt;/p&gt;

&lt;p&gt;Week 3 is the first LoRA fine-tuning run: configuring PEFT, setting up SFTTrainer, running training on the T4, and comparing the fine-tuned model's outputs against the Week 1 baseline. That is where the project works or reveals what needs fixing.&lt;br&gt;
&lt;a href="https://huggingface.co/datasets/nicholas-ugbala-hf/chatdoctor-cleaned-10k" rel="noopener noreferrer"&gt;Cleaned dataset&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/nicholas-ugbala-dev/healthcare-llm-finetune" rel="noopener noreferrer"&gt;healthcare-llm-finetune Repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>Fine-Tuning Llama 3.2 3B on Medical QA: Week 1 Setup and Baseline Inference</title>
      <dc:creator>Nicholas (Kosisochukwu) Ugbala</dc:creator>
      <pubDate>Tue, 19 May 2026 09:56:18 +0000</pubDate>
      <link>https://dev.to/nicholas-ugbala-dev/fine-tuning-llama-32-3b-on-medical-qa-week-1-setup-and-baseline-inference-3k25</link>
      <guid>https://dev.to/nicholas-ugbala-dev/fine-tuning-llama-32-3b-on-medical-qa-week-1-setup-and-baseline-inference-3k25</guid>
      <description>&lt;h2&gt;
  
  
  The Problem With General-Purpose LLMs in Healthcare
&lt;/h2&gt;

&lt;p&gt;Ask a general-purpose LLM about the early symptoms of type 2 diabetes and it might tell you:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;"When your body produces more insulin, it can cause your body to hold onto more water, leading to increased thirst."&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That is wrong. Increased thirst in diabetes is caused by high blood glucose pulling fluid from tissues, not insulin. The model arrived at that answer through superficial pattern matching: insulin, fluid retention, and thirst are highly correlated keywords in general web text, so the model linked them causally. In a general conversation, that kind of error is a minor annoyance. In a clinical context, it erodes patient trust and can lead to incorrect self-triage.&lt;br&gt;
This is the core problem with using general-purpose LLMs in healthcare: they are medically coherent on the surface but clinically unreliable underneath. Fine-tuning on a curated medical dataset penalizes these superficial token correlations and forces the model to adhere to verified clinical pathways.&lt;br&gt;
That is what this project is about.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I Am Building and Why
&lt;/h2&gt;

&lt;p&gt;I am fine-tuning Llama 3.2 3B Instruct on a medical question-answering dataset, deploying it as a public inference API, and documenting the full pipeline from dataset choice to evaluation to deployment.&lt;br&gt;
The goal is to learn the complete fine-tuning pipeline: data preparation, LoRA training, evaluation, and deployment. Healthcare is the domain for this project, but the pipeline is domain-agnostic. Every decision made here applies equally to legal QA, customer support, or any other vertical that needs a specialised language model.&lt;/p&gt;

&lt;p&gt;By the end of the project, I will have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A public GitHub repo with training code and reproducibility instructions&lt;/li&gt;
&lt;li&gt;A fine-tuned model checkpoint on Hugging Face Hub&lt;/li&gt;
&lt;li&gt;A live FastAPI inference endpoint&lt;/li&gt;
&lt;li&gt;A technical write-up of every decision made and why&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Model and Dataset Choice
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Base Model: Llama 3.2 3B Instruct
&lt;/h3&gt;

&lt;p&gt;The 3B sits in the right spot for the constraints of this project. It is capable enough to produce meaningful answers on medical QA even before fine-tuning, so the baseline is not embarrassing and the demo is not hollow. It is small enough to train on a free-tier T4 GPU with QLoRA. And it has the richest fine-tuning ecosystem of any open model family.&lt;/p&gt;
&lt;h3&gt;
  
  
  Dataset: MedQuAD (via lavita/medical-qa-datasets on Hugging Face)
&lt;/h3&gt;

&lt;p&gt;MedQuAD is sourced from the USMLE, the United States Medical Licensing Examination. These are board exam questions written and validated by medical professionals. The data is clean, authoritative, and NIH-sourced. Provenance matters in healthcare AI. Using forum-scraped data might produce a more conversational model, but you cannot defend the quality of the training data. With MedQuAD, you can.&lt;/p&gt;

&lt;p&gt;An honest caveat: It is worth acknowledging that the USMLE questions reflect US clinical guidelines; it is not a dealbreaker for a portfolio project demonstrating the fine-tuning pipeline. A production system deployed globally would require localisation.&lt;/p&gt;
&lt;h3&gt;
  
  
  Infrastructure and Deployment
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Training Compute:&lt;/strong&gt; Google Colab (NVIDIA T4 GPU, 15.8GB VRAM). Colab is a cloud-hosted notebook environment that provisions GPU compute on demand. Think of it as a rented virtual machine with a GPU attached, accessible entirely through the browser. It is the right tool for a training run at this scale: no local GPU required, no cloud provisioning overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Hosting:&lt;/strong&gt; Hugging Face Hub. The GitHub of ML models. Model weights, versioned checkpoints, and model cards all live here publicly.&lt;br&gt;
Inference API: FastAPI. A lightweight Python web framework for wrapping the model in an HTTP endpoint.&lt;br&gt;
&lt;strong&gt;Containerisation:&lt;/strong&gt; Docker. The FastAPI inference server is containerised, so the deployment is reproducible.&lt;br&gt;
&lt;strong&gt;Fallback GPU:&lt;/strong&gt; RunPod or Lambda Labs, if the free Colab tier hits limits on the larger training run in Week 4.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Library Stack
&lt;/h2&gt;
&lt;h3&gt;
  
  
  torch (PyTorch)
&lt;/h3&gt;

&lt;p&gt;The foundation on which everything else runs. PyTorch is a numerical computation engine. It handles tensors (multidimensional arrays), moves data between CPU and GPU, and executes the mathematical operations that make the model run. Every other library in this stack is built on top of it.&lt;/p&gt;

&lt;p&gt;Why PyTorch and not TensorFlow? The open-source LLM ecosystem, Meta, Mistral, Qwen, ships in PyTorch.&lt;/p&gt;
&lt;h3&gt;
  
  
  transformers (Hugging Face)
&lt;/h3&gt;

&lt;p&gt;The library that knows how to talk to a specific model. Every model architecture has its own structure: different layer names, tokenization logic, and chat templates. &lt;code&gt;transformers&lt;/code&gt; abstracts all of that.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AutoModelForCausalLM&lt;/code&gt;: "Auto" means inspect the model config, determine the architecture, and load the correct class automatically. "ForCausalLM" means load it in text-generation mode (predict the next token), not classification mode.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AutoTokenizer&lt;/code&gt;: converts raw text into token IDs the model understands, and converts the model's output IDs back into readable text.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;apply_chat_template&lt;/code&gt;: Llama 3.2 was trained on a specific format using &lt;code&gt;&amp;lt;|system|&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;|user|&amp;gt;&lt;/code&gt;, and &lt;code&gt;&amp;lt;|assistant|&amp;gt;&lt;/code&gt; tags. Feed it raw text without that structure and the model does not know where the user's question ends and its answer should begin. &lt;code&gt;apply_chat_template&lt;/code&gt; applies the correct format automatically.&lt;/p&gt;
&lt;h3&gt;
  
  
  bitsandbytes (via BitsAndBytesConfig)
&lt;/h3&gt;

&lt;p&gt;This is what enables 4-bit quantization, the technique that makes a 3B parameter model fit on a 15.8GB T4 GPU.&lt;/p&gt;

&lt;p&gt;A 3 billion parameter model is 3 billion numbers stored in memory. By default, each number takes 32 bits of space. That puts the raw model at approximately 12GB just to load, before a single token has been processed.&lt;/p&gt;
&lt;h3&gt;
  
  
  peft
&lt;/h3&gt;

&lt;p&gt;Parameter-Efficient Fine-Tuning. The library that implements LoRA.&lt;/p&gt;

&lt;p&gt;Training all 3 billion parameters from scratch would require roughly 24GB of VRAM and hours of compute. LoRA instead adds small trainable adapter matrices to specific layers and trains only those, approximately 1 to 5% of total parameters. The original weights stay frozen. This is what makes fine-tuning feasible on consumer hardware.&lt;/p&gt;
&lt;h3&gt;
  
  
  trl (SFTTrainer)
&lt;/h3&gt;

&lt;p&gt;Handles the supervised fine-tuning training loop. It manages dataset formatting, gradient accumulation, metric logging, and checkpoint saving. Without it, you write the training loop manually in raw PyTorch, which is correct but tedious.&lt;/p&gt;
&lt;h3&gt;
  
  
  datasets (Hugging Face)
&lt;/h3&gt;

&lt;p&gt;Standardised interface for loading, filtering, and splitting datasets. Handles streaming large datasets without loading everything into RAM at once. One line to load MedQuAD, one line to split into train and eval.&lt;/p&gt;
&lt;h3&gt;
  
  
  accelerate (Hugging Face)
&lt;/h3&gt;

&lt;p&gt;Device management. When you write &lt;code&gt;device_map="auto"&lt;/code&gt;, Accelerate determines how to distribute the model across available hardware. Mostly called under the hood by other libraries, so you rarely interact with it directly.&lt;/p&gt;


&lt;h2&gt;
  
  
  Loading the Model
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BitsAndBytesConfig&lt;/span&gt;

&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-3.2-3B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;bnb_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_use_double_quant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_special_tokens&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pad_token&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;pad&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resize_token_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;padding_side&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;right&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A 3 billion parameter model is essentially 3 billion numbers stored in memory. By default, each number takes up 32 bits of space. That is a lot: imagine trying to fit a library into a small room.&lt;br&gt;
Quantization is the process of shrinking each number to take up less space. Instead of 32 bits per number, we store it in 4 bits. The room is the same size, but now the books are smaller, so everything fits. For context, it reduces the books to a size of very tiny stickers; the real size will be a microscopic book.&lt;br&gt;
The tradeoff is a tiny loss in precision; the numbers are slightly less exact. But for most tasks, including medical QA, the model performs well enough that you would not notice the difference. &lt;br&gt;
Quantization shrinks each parameter. The config specifically uses:&lt;br&gt;
&lt;code&gt;load_in_4bit=True&lt;/code&gt;: store weights in 4-bit instead of 32-bit. Now the 3B model loads in ~2GB.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bnb_4bit_quant_type=”nf4”&lt;/code&gt;: NormalFloat4. A smarter 4-bit format designed specifically for neural network weights, which follow a roughly normal distribution. Better quality than a naive 4-bit rounding.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bnb_4bit_compute_dtype=torch.float16&lt;/code&gt;: even though weights are stored in 4-bit, actual computation (matrix multiplication) happens in float16, we can’t do math in 4-bit, we decompress on the fly, compute, then recompress. Float16 is the middle ground for accuracy and speed.&lt;br&gt;
&lt;code&gt;bnb_4bit_use_double_quant=True&lt;/code&gt;: quantize the quantization constants too. Saves a small additional amount of VRAM.  &lt;/p&gt;
&lt;h2&gt;
  
  
  The Tokenizer
&lt;/h2&gt;

&lt;p&gt;The model cannot read text. It only understands numbers.&lt;/p&gt;

&lt;p&gt;Before any text enters the model, it is converted into a sequence of numbers called token IDs. After the model generates numbers as output, something converts them back into readable text. That is the tokenizer's job. It sits at the entry and exit point of the model.&lt;/p&gt;

&lt;p&gt;The tokenizer does not split text into letters or whole words. It splits into tokens, which are chunks that could be a full word, part of a word, punctuation, or a special marker.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What causes iron deficiency anemia?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ['What', 'Ġcauses', 'Ġiron', 'Ġdeficiency', 'Ġanemia', '?']
&lt;/span&gt;
&lt;span class="n"&gt;ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# [3923, 11384, 11245, 32090, 42075, 30]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model sees &lt;code&gt;[3923, 11384, 11245, 32090, 42075, 30]&lt;/code&gt;. Never the actual words.&lt;/p&gt;

&lt;p&gt;Longer medical words split into multiple tokens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hypertension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ['hyper', 'tension']   two tokens, one word
&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acetaminophen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ['acet', 'amin', 'oph', 'en']   four tokens, one word
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is why token count and word count are different numbers, and why models have a context window measured in tokens, not words.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tokenizer.add_special_tokens({'pad_token': '&amp;lt;pad&amp;gt;'})&lt;/code&gt;: Llama's tokenizer has no dedicated padding token. Without one, the tokenizer reuses the end-of-sequence token for padding, which means the model cannot distinguish between "end of response" and "this is padding." Adding a distinct pad token removes that ambiguity.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;model.resize_token_embeddings(len(tokenizer))&lt;/code&gt;: the model's vocabulary just grew by one token. This line tells the model about it. Without it, the model would not know how to handle the new pad token and would crash.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running Inference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Part 1: Encoding the Input
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful medical assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;encoded_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;return_dict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;apply_chat_template&lt;/code&gt; converts the messages list into the exact string format Llama 3.2 was trained on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;|begin_of_text|&amp;gt;
&amp;lt;|start_header_id|&amp;gt;system&amp;lt;|end_header_id|&amp;gt;
You are a helpful medical assistant.
&amp;lt;|eot_id|&amp;gt;
&amp;lt;|start_header_id|&amp;gt;user&amp;lt;|end_header_id|&amp;gt;
What causes iron deficiency anemia?
&amp;lt;|eot_id|&amp;gt;
&amp;lt;|start_header_id|&amp;gt;assistant&amp;lt;|end_header_id|&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;add_generation_prompt=True&lt;/code&gt;&lt;/strong&gt;: adds the final &lt;code&gt;&amp;lt;|start_header_id|&amp;gt;assistant&amp;lt;|end_header_id|&amp;gt;&lt;/code&gt; line. Without it, the model has no signal that the user has finished speaking and it is now expected to respond. Use &lt;code&gt;True&lt;/code&gt; during inference. Use &lt;code&gt;False&lt;/code&gt; during training, where the full assistant response is already in the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;return_tensors="pt"&lt;/code&gt;&lt;/strong&gt;: packages the token IDs as a PyTorch tensor. The model cannot consume a plain Python list or a NumPy array. &lt;code&gt;"pt"&lt;/code&gt; means PyTorch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;return_dict=True&lt;/code&gt;&lt;/strong&gt;: returns both &lt;code&gt;input_ids&lt;/code&gt; and &lt;code&gt;attention_mask&lt;/code&gt; as a dictionary instead of just the token IDs tensor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The attention mask&lt;/strong&gt; is a companion tensor to &lt;code&gt;input_ids&lt;/code&gt;, same length, containing only 1s and 0s. 1 means real token, pay attention to it. 0 means padding, ignore it. Without it, the model treats padding as real content and the output degrades, especially during training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;.to(model.device)&lt;/code&gt;&lt;/strong&gt;: moves the tensor from system RAM (CPU memory) to VRAM (GPU memory), where the model lives. Data and model must be in the same memory location.&lt;code&gt;model.device&lt;/code&gt; resolves to &lt;code&gt;"cuda:0"&lt;/code&gt; on the T4, and works on CPU machines too, unlike hardcoding &lt;code&gt;"cuda"&lt;/code&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Part 2: Generating and Decoding
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;encoded_inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;attention_mask&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;encoded_inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attention_mask&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;top_p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;pad_token_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pad_token_id&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;input_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encoded_inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;input_length&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;max_new_tokens=1000&lt;/code&gt;&lt;/strong&gt;: hard ceiling on tokens generated. This counts only new tokens, not the input. Without it the model generates indefinitely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;do_sample=True&lt;/code&gt;&lt;/strong&gt;: sample from the probability distribution instead of always picking the highest-scoring token. Deterministic generation (&lt;code&gt;False&lt;/code&gt;) produces repetitive, robotic text. For medical QA, we want coherent prose, so sampling gives that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;temperature=0.7&lt;/code&gt;&lt;/strong&gt;: controls how sharp or flat the probability distribution is before sampling. Think of it as a confidence dial. At 0.1, the model always picks the most predictable next word. At 1.5, it picks surprising ones. 0.7 sits between them: responses are grounded and accurate, but not so rigid they read like a dictionary entry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;top_p=0.9&lt;/code&gt;&lt;/strong&gt;: nucleus sampling. After temperature reshapes the distribution, &lt;code&gt;top_p&lt;/code&gt; cuts off the long tail. It ranks all tokens by probability, sums them from highest to lowest, and stops at 90%. Only tokens within that 90% are considered. The unlikely, potentially nonsensical tokens at the bottom get discarded. Temperature and &lt;code&gt;top_p&lt;/code&gt; work together: one reshapes, the other prunes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;pad_token_id=tokenizer.pad_token_id&lt;/code&gt;&lt;/strong&gt;: explicitly tells the generate function which token represents padding. Removes any remaining ambiguity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;outputs[0][input_length:]&lt;/code&gt;&lt;/strong&gt;: &lt;code&gt;model.generate&lt;/code&gt; returns the full sequence, input tokens plus generated tokens concatenated. This slice discards the input and keeps only what the model generated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;tokenizer.decode(response, skip_special_tokens=True)&lt;/code&gt;&lt;/strong&gt;: converts token IDs back to readable text and strips structural tokens like &lt;code&gt;&amp;lt;|eot_id|&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;|end_of_text|&amp;gt;&lt;/code&gt; from the output.&lt;/p&gt;




&lt;h2&gt;
  
  
  Baseline Results
&lt;/h2&gt;

&lt;p&gt;The model was run against five test questions before any fine-tuning. These outputs serve as the benchmark. Every evaluation after training will be compared directly against them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Q: What are the early symptoms of type 2 diabetes?)
A: assistant

As a medical assistant, I'd be happy to help you identify the early symptoms of type 2 diabetes.

Type 2 diabetes is often referred to as "silent diabetes" because it can be asymptomatic for a long time, especially in the early stages. However, some people may experience the following early symptoms:

1. **Increased thirst and hunger**: When your body produces more insulin, it can cause your body to hold onto more water, leading to increased thirst and hunger.
2. **Frequent urination**: As your body tries to flush out excess glucose, you may need to urinate more frequently, especially at night.
3. **Fatigue**: High blood sugar levels can cause fatigue, which can be mistaken for other conditions.
4. **Blurred vision**: High blood sugar levels can cause the lens in your eye to swell, leading to blurred vision.
5. **Slow healing of cuts and wounds**: High blood sugar levels can impede the healing process, leading to slower recovery from cuts and wounds.
6. **Tingling or numbness in your hands and feet**: High blood sugar levels can damage nerve endings, causing numbness or tingling sensations in your hands and feet.
7. **Recurring skin, gum, or bladder infections**: High blood sugar levels can weaken your immune system, making you more susceptible to infections.
8. **Weight gain or loss**: Some people may experience weight gain or loss due to the changes in their metabolism.
9. **Mood changes**: High blood sugar levels can cause mood swings, anxiety, and depression.

Keep in mind that these symptoms can be subtle and may not always be present. If you're concerned about your risk of developing type 2 diabetes, it's essential to talk to your healthcare provider about your individual risk factors and get regular check-ups.

Remember, early detection and treatment can make a significant difference in managing type 2 diabetes. If you're experiencing any of these symptoms, don't hesitate to consult with your healthcare provider.
============================================================
Q: How is malaria diagnosed and treated?)
A: assistant

As a medical assistant, I'd be happy to explain the diagnosis and treatment of malaria.

**Diagnosis:**

Malaria is typically diagnosed using a combination of clinical evaluation, laboratory tests, and rapid diagnostic tests (RDTs). Here are the steps involved in diagnosing malaria:

1. **Clinical evaluation:** A healthcare provider will ask about symptoms, medical history, and travel history to determine if the patient has malaria.
2. **Physical examination:** A physical examination will be performed to look for signs of malaria, such as fever, chills, and flu-like symptoms.
3. **Blood tests:** A blood sample will be taken to check for the presence of malaria parasites in the blood. This can be done using:
    * **Microscopy:** A microscope is used to examine the blood sample under a microscope to look for the parasites.
    * **Rapid diagnostic tests (RDTs):** RDTs are quick and easy-to-use tests that detect the presence of malaria antigens in the blood.
    * **Polymerase chain reaction (PCR):** PCR is a molecular test that detects the genetic material of the malaria parasite.
4. **Confirmatory testing:** If the initial tests are positive, further testing may be done to confirm the diagnosis.

**Treatment:**

Treatment for malaria typically involves antimalarial medications. The type and dosage of medication will depend on the type of malaria, the patient's age, weight, and medical history, as well as any underlying medical conditions.

**Common antimalarial medications:**

1. **Artemisinin-based combination therapies (ACTs):** These are the most commonly used medications for treating malaria. Examples include artesunate and mefloquine.
2. **Quinine:** This is an older medication that is still used to treat malaria, particularly in areas where ACTs are not available.
3. **Clindamycin:** This antibiotic is used to treat malaria in combination with other medications.

**Treatment guidelines:**

1. **Adults:** Treatment for malaria in adults typically involves a 3-day course of ACTs.
2. **Children:** Treatment for malaria in children typically involves a 3-day course of ACTs or a 7-day course of mefloquine.
3. **Pregnant women:** Treatment for malaria in pregnant women typically involves a 3-day course of ACTs.

**Prevention:**

Preventing malaria involves a combination of measures, including:

1. **Vaccination:** The RTS,S vaccine is available for people living in areas with high malaria transmission.
2. **Insecticide-treated bed nets:** Using bed nets treated with insecticides can help prevent mosquito bites.
3. **Personal protective equipment (PPE):** Wearing PPE, such as long-sleeved shirts and pants, can help prevent mosquito bites.
4. **Screened windows and doors:** Installing screened windows and doors can help prevent mosquito bites.
5. **Eliminating standing water:** Eliminating standing water around homes and communities can help prevent mosquito breeding.

It's essential to note that malaria diagnosis and treatment should only be done under the guidance of a qualified healthcare provider.
============================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What is working.&lt;/strong&gt; Responses are medically coherent, well structured, and consistently include appropriate caveats like "consult a healthcare provider." The malaria answer is particularly strong. It correctly distinguishes between microscopy, RDTs, and PCR for diagnosis, and covers ACT treatment guidelines accurately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What fine-tuning should improve.&lt;/strong&gt; The model frequently opens with "As a medical assistant, I'd be happy to help you..." That is a filler pattern from instruction tuning, not genuine clinical reasoning. Fine-tuning on MedQuAD should shift responses toward direct, factual answers without the preamble.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One clear failure.&lt;/strong&gt; The diabetes answer contains a factual error: it attributes increased thirst to insulin causing water retention. The correct mechanism is high blood glucose pulling fluid from tissues through osmosis. This is the hallucination described at the top of this article, superficial keyword correlation producing a plausible-sounding but clinically incorrect causal chain. It is documented here because it is exactly what fine-tuning on curated medical data is meant to fix. If the fine-tuned model gets this right, that is a meaningful result.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Week 1 is done. The environment is confirmed, the model is loaded, and the baseline is saved to the repo.&lt;/p&gt;

&lt;p&gt;Week 2 is data preparation: loading MedQuAD, inspecting the schema, formatting the dataset into the instruction template Llama expects, and setting up train and eval splits. That is where the actual fine-tuning pipeline begins.&lt;/p&gt;

&lt;p&gt;The repo is public: [&lt;a href="https://github.com/nicholas-ugbala-dev/healthcare-llm-finetune.git" rel="noopener noreferrer"&gt;https://github.com/nicholas-ugbala-dev/healthcare-llm-finetune.git&lt;/a&gt;]&lt;/p&gt;

</description>
      <category>ai</category>
      <category>learning</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Introduction to Machine Learning for Everyone!</title>
      <dc:creator>Nicholas (Kosisochukwu) Ugbala</dc:creator>
      <pubDate>Tue, 25 Mar 2025 08:14:53 +0000</pubDate>
      <link>https://dev.to/nicholas-ugbala-dev/introduction-to-machine-learning-for-everyone-5d5f</link>
      <guid>https://dev.to/nicholas-ugbala-dev/introduction-to-machine-learning-for-everyone-5d5f</guid>
      <description>&lt;h2&gt;
  
  
  Machine Learning: Supervised and Unsupervised Learning
&lt;/h2&gt;

&lt;p&gt;As defined by Arthur Samuel, machine learning is a field of study that allows computers to learn without being explicitly programmed. This enables computers to make predictions and decisions based on data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supervised Learning
&lt;/h2&gt;

&lt;p&gt;Supervised learning refers to algorithms that learn from labeled data, mapping input (X) to output (Y). This means the algorithm is provided with examples to learn from before making predictions. Supervised learning is categorized into two main types: regression and classification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Regression
&lt;/h3&gt;

&lt;p&gt;Regression is used when the goal is to predict a continuous number from infinitely many possible values.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdu1q7erslvucslcqlesp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdu1q7erslvucslcqlesp.png" alt="House pricing prediction" width="800" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;House Price Prediction&lt;/strong&gt;: Given features such as the size of a house, location, and number of rooms, a regression model can predict the price of the house.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Classification
&lt;/h3&gt;

&lt;p&gt;Classification, on the other hand, is used to predict categories, where the possible outputs are finite and limited.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz6mblh8hm94cht2oj8fz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz6mblh8hm94cht2oj8fz.png" alt="Breast cancer detection" width="800" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Disease Diagnosis&lt;/strong&gt;: Identifying whether a patient has a particular disease based on symptoms and test results (e.g., Breast cancer is benign or malignant).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Difference Between Regression and Classification
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Regression&lt;/th&gt;
&lt;th&gt;Classification&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Predicts a continuous number&lt;/td&gt;
&lt;td&gt;Predicts categories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infinitely many possible outputs&lt;/td&gt;
&lt;td&gt;Small, limited set of possible outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example: Predicting house prices&lt;/td&gt;
&lt;td&gt;Example: Identifying breast cancer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Unsupervised Learning
&lt;/h2&gt;

&lt;p&gt;Unlike supervised learning, unsupervised learning works with data that has no labels. The algorithm must find patterns and structure in the data on its own. A key example of unsupervised learning is clustering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clustering
&lt;/h3&gt;

&lt;p&gt;Clustering algorithms group data points into clusters based on similarities.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer Segmentation&lt;/strong&gt;: E-commerce companies use clustering to group customers based on purchasing behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Social Media Analysis&lt;/strong&gt;: Identifying communities of users with similar interests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Genetic Research&lt;/strong&gt;: Grouping genes with similar expressions for medical research.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Other Types of Unsupervised Learning
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anomaly Detection&lt;/strong&gt;: Used to detect fraud or unusual activity, such as spotting suspicious transactions in financial data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dimensionality Reduction&lt;/strong&gt;: Compresses data using fewer numbers, making it easier to analyze while preserving its essential structure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will explore these types of supervised and unsupervised learning in detail in upcoming discussions. Machine learning is an expansive and captivating field with applications across nearly every industry. Stay tuned as we delve deeper into its concepts and real-world implementations!&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
