DEV Community

Cover image for How I built a medical dataset pipeline for LLM fine-tuning
Akhilesh
Akhilesh

Posted on

How I built a medical dataset pipeline for LLM fine-tuning

Before you train a model, you need data in the right format. This took me longer than I expected and taught me a lot about how LLMs actually learn.

The dataset

I used MedQA USMLE — real medical licensing exam questions used to certify doctors in the US. It's available on HuggingFace for free.

from datasets import load_dataset
dataset = load_dataset("GBaker/MedQA-USMLE-4-options")
Enter fullscreen mode Exit fullscreen mode

Each sample looks like this:

question: A 23-year-old pregnant woman with burning urination...
options: {A: Ampicillin, B: Ceftriaxone, C: Doxycycline, D: Nitrofurantoin}
answer: Nitrofurantoin
answer_idx: D
Enter fullscreen mode Exit fullscreen mode

Total: 10,178 training questions, 1,273 test questions.

The problem: raw data isn't training data

You can't just feed raw MCQ questions to a language model. LLMs learn from conversations — specifically from the instruction format where you tell the model who it is, give it a question, and show it the correct response.

This is called instruction tuning, and it's how ChatGPT, Claude, and most modern LLMs are trained.

Converting to instruction format

Every sample needs to become this structure:

<s>[INST] <<SYS>>
You are MedMind, an expert clinical AI...
<</SYS>>

Clinical Question: [question]
Options: A: ... B: ... C: ... D: ...
What is the best answer and why? [/INST]

Let me analyze this step by step.
The correct answer is D: Nitrofurantoin
Clinical Reasoning: ... </s>
Enter fullscreen mode Exit fullscreen mode

I wrote a script that converts every raw question into this format automatically.

Cleaning the data

After conversion I ran validation:

  • Answer distribution: A=25.4%, B=26.1%, C=25.1%, D=23.4% — nearly perfect balance
  • Found 2 duplicate questions — removed
  • Found 2 questions over 600 words — removed (too long for training)
  • Final clean dataset: 10,174 samples, 25.8MB

The balanced answer distribution matters because if 80% of answers were "D", the model would learn to always say D.

What I learned

Data preparation is underrated. Most ML tutorials skip straight to training, but the quality of your data determines the quality of your model more than any hyperparameter.

Top comments (0)