How I built a medical dataset pipeline for LLM fine-tuning

#ai #python #beginners #productivity

Before you train a model, you need data in the right format. This took me longer than I expected and taught me a lot about how LLMs actually learn.

The dataset

I used MedQA USMLE — real medical licensing exam questions used to certify doctors in the US. It's available on HuggingFace for free.

from datasets import load_dataset
dataset = load_dataset("GBaker/MedQA-USMLE-4-options")

Each sample looks like this:

question: A 23-year-old pregnant woman with burning urination...
options: {A: Ampicillin, B: Ceftriaxone, C: Doxycycline, D: Nitrofurantoin}
answer: Nitrofurantoin
answer_idx: D

Total: 10,178 training questions, 1,273 test questions.

The problem: raw data isn't training data

You can't just feed raw MCQ questions to a language model. LLMs learn from conversations — specifically from the instruction format where you tell the model who it is, give it a question, and show it the correct response.

This is called instruction tuning, and it's how ChatGPT, Claude, and most modern LLMs are trained.

Converting to instruction format

Every sample needs to become this structure:

<s>[INST] <<SYS>>
You are MedMind, an expert clinical AI...
<</SYS>>

Clinical Question: [question]
Options: A: ... B: ... C: ... D: ...
What is the best answer and why? [/INST]

Let me analyze this step by step.
The correct answer is D: Nitrofurantoin
Clinical Reasoning: ... </s>

I wrote a script that converts every raw question into this format automatically.

Cleaning the data

After conversion I ran validation:

Answer distribution: A=25.4%, B=26.1%, C=25.1%, D=23.4% — nearly perfect balance
Found 2 duplicate questions — removed
Found 2 questions over 600 words — removed (too long for training)
Final clean dataset: 10,174 samples, 25.8MB

The balanced answer distribution matters because if 80% of answers were "D", the model would learn to always say D.

What I learned

Data preparation is underrated. Most ML tutorials skip straight to training, but the quality of your data determines the quality of your model more than any hyperparameter.