LoRA vs QLoRA vs Full Fine-tuning: GPU Memory Benchmarks

#lora #qlora #finetuning #llm

Why This Matters: $5/hour vs $0.50/hour

Full fine-tuning a 7B parameter model on AWS costs around $5/hour on a single A100. LoRA drops that to under $1/hour on a T4. QLoRA? $0.50/hour, sometimes less.

But here's the catch: lower cost usually means lower quality. The question is how much quality you're trading away, and whether it actually matters for your use case. I spent a week running the same fine-tuning job three different ways — full fine-tuning, LoRA, and QLoRA — to see where the GPU memory really goes and what you get for the price difference.

The results weren't what I expected. QLoRA matched full fine-tuning accuracy on my task, while LoRA fell short. That shouldn't happen according to the papers, but it did.

Arduino and LoRa components set up on a breadboard for a DIY project. — Photo by Bmonster Lab on Pexels

The Memory Breakdown Nobody Shows You

Most comparisons just tell you "LoRA uses less memory." Cool, but where does the memory actually go during training? I instrumented a Llama-2 7B fine-tune to track peak memory at each stage:

Full fine-tuning (7B model, batch size 4, sequence length 512):

Continue reading the full article on TildAlice

Top comments (1)

AutoJanitor • Mar 1

Great timing on this — we just finished a QLoRA fine-tune and these benchmarks match our real-world experience almost exactly.

We trained "SophiaCore" — a Qwen2.5-7B with identity and tool-calling baked in — using SFT (32,466 examples, 1 epoch, ~69 hours) followed by DPO (2,012 pairs, 2 epochs, ~7 hours). All done with QLoRA on consumer hardware. The finding that QLoRA matched full fine-tuning accuracy on your task is something we can corroborate: our Q4_K_M quantized model scores 98-99/100 on identity tests and 35/36 on capability regression tests — matching stock Qwen2.5 on everything except the new behaviors we added.

The $5/hour vs $0.50/hour framing is the right way to think about this. Our total training cost was effectively zero because we ran it on lab hardware (RTX 5070 12GB + RTX 4070 8GB), but the time difference was dramatic. QLoRA let us fit a 7B model comfortably in 12GB VRAM with room for batch size 4. Full fine-tuning would have required our V100 32GB or the POWER8's 512GB RAM — neither of which had the right software stack for training at the time.

One thing I'd add to your analysis: the quality gap between LoRA and QLoRA that you noticed (QLoRA winning) might be because NF4 quantization preserves the pre-trained weight distribution better than LoRA's low-rank approximation during the adapter merge step. We saw something similar — our SFT adapter merged cleanly with the NF4 base, but a LoRA-only experiment had subtle regression on complex reasoning tasks.

Would love to see your memory breakdown extended to the DPO/RLHF stage. That's where memory usage gets truly unpredictable — reference model + policy model + value head all competing for VRAM simultaneously.